Printer Friendly
The Free Library
14,559,005 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Half the equation: open standards are a first step toward speech automation.


Today, well-engineered speech recognition systems achieve high customer satisfaction and high returns on investment in many customer service areas, including stock trading, flight information, catalog ordering and directory assistance. Although speech automation's potential has become widely recognized, few IT organizations have had the means to build or maintain speech systems, relying instead on expensive services from speech engine vendors or specialist system integrators. One major impediment A disability or obstruction that prevents an individual from entering into a contract.

Infancy, for example, is an impediment in making certain contracts. Impediments to marriage include such factors as consanguinity between the parties or an earlier marriage that is still valid.
 to speech development efforts was removed when the industry adopted open standards Specifications for hardware and software that are developed by a standards organization or a consortium involved in supporting a standard. Available to the public for developing compliant products, open standards imply "open systems;" that an existing component in a system can be replaced  and Web technologies familiar to mainstream IT organizations. However, a larger obstacle still remains: speech development methodologies and tools must improve to address the unique demands of voice user interfaces before mainstream enterprises can reliably deliver high quality speech systems at a reasonable cost.

**********

The First Step: Open Speech Standards

The earliest development approaches required programming in the application program interface (API (Application Programming Interface) A language and message format used by an application program to communicate with the operating system or some other control program such as a database management system (DBMS) or communications protocol. ) specific to each speech recognition engine. This approach burdened developers with low-level, recognition engine-specific details such as exception handling and resource management. Moreover, the proprietary nature of these APIs restricted the flexibility with which enterprises could deploy applications. Most software components had to be sourced from a single vendor and had to be deployed in a single location, and the resulting applications could not be easily ported to other platforms.

The advent of voice languages such as VoiceXML and SALT contributed to a Web-based development process. These languages allow a distribution of responsibilities in a speech system between a voice browser A voice browser is a web browser that presents an interactive voice user interface to the user. In addition, it typically provides an interface to the PSTN or a PBX. Just as a visual web browser works with HTML pages, a voice browser operates on pages that specify voice dialogues. , which performs the speech recognition function, and a server application, which contains the application logic and user interface behavior (expressed in the voice language). As a result, application developers no longer concern themselves with speech engine API calls, but instead are responsible for generating documents that can be executed by the voice browser.

VoiceXML (Voice Extensible Markup Language See XML.

(language, text) Extensible Markup Language - (XML) An initiative from the W3C defining an "extremely simple" dialect of SGML suitable for use on the World-Wide Web.

http://w3.org/XML/.
) is a standard endorsed by the World Wide Web Consortium (W3C (World Wide Web Consortium, www.w3.org) An international industry consortium founded in 1994 by Tim Berners-Lee to develop standards for the Web. It is hosted in the U.S. by the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT (www.csail.mit.edu/index.php). ) for speech application development. The first specification was released in March 2000 by the VoiceXML Forum (www.voicexml.org/), an industry body that now has 375 member companies, including IBM (International Business Machines Corporation, Armonk, NY, www.ibm.com) The world's largest computer company. IBM's product lines include the S/390 mainframes (zSeries), AS/400 midrange business systems (iSeries), RS/6000 workstations and servers (pSeries), Intel-based servers (xSeries) , Nuance, Motorola and AT & T. The latest version, VoiceXML 2.0, became a W3C recommendation A W3C Recommendation is the final stage of a ratification process of the World Wide Web Consortium (W3C) working group concerning the standard. It is the equivalent of a published standard in many other industries.  in March 2004. VoiceXML voice browsers are already available through dozens of vendors; in all, a hundred or so vendors provide compliant products. Commercial VoiceXML deployments have been estimated in the thousands.

SALT is a newer standard, proposed by the SALT Forum (www.saltforum.org/), and is somewhat competitive with VoiceXML. The intent of SALT is to facilitate multimodal Two or more modes of operation. The term is used to refer to a myriad of functions and conditions in which two or more different methods, processes or forms of delivery are used. On the Web, it refers to asking for something one way and receiving the answer another; for example requesting  applications, allowing spoken interfaces to be used in conjunction with a keyboard and a display screen, so that Web pages can be accessed by different client devices. However, SALT can also be used to build voice-only applications, and one of its targets is to simplify speech application development. The major proponent of SALT is Microsoft, but many companies support both SALT and VoiceXML, including Intel, Cisco, HP and ScanSoft. Only a few SALT voice browsers are currently available. The most prominent is Microsoft's Speech Server, which has attracted developer interest due to its integration with Microsoft's .NET framework. To date, SALT has few publicly announced commercial deployments.

VoiceXML is a larger language that contains its own procedural and transport elements. In contrast, SALT is a lightweight extension to existing markup languages
  • List of XML markup languages
  • List of general purpose markup languages
  • List of document markup languages
  • List of content syndication markup languages
  • List of lightweight markup languages
  • List of user interface markup languages
, most notably HTML HTML
 in full HyperText Markup Language

Markup language derived from SGML that is used to prepare hypertext documents. Relatively easy for nonprogrammers to master, HTML is the language used for documents on the World Wide Web.
 and XHTML (EXtensible HTML) A markup language for Web pages from the W3C. XHTML combines HTML and XML into a single format (HTML 4.0 and XML 1.0). Like XML, XHTML can be extended with proprietary tags. Also like XML, XHTML must be coded more rigorously than HTML. . SALT tags are embedded Inserted into. See embedded system.  within the HTML DOM (document object model) event and scripting environment, a model familiar to Web developers. Dialog flow is managed by combining SALT elements with DOM object properties, methods and events. This programming approach is well-suited to multimodal applications because visual and speech elements on a Web document are peers. VoiceXML, on the other hand, has constructs designed specifically for speech-only interfaces, such as dialogs with predefined execution flows.

Despite the competition, SALT supports various W3C standards associated with the VoiceXML standard, including SRGS SRGS Speech Recognition Grammar Specification
SRGS Stimulated Raman Gain Spectroscopy
SRGS Survivable Relay Ground Stations
, the W3C speech recognition grammar specification Speech Recognition Grammar Specification (SRGS) is an W3C recommendation that defines syntax for representing grammars for use in speech recognition so that developers can specify the words and patterns of words to be listened for by a speech recognizer. ; SSML SSML Speech Synthesis Markup Language , the W3C language for controlling TTS (1) See text-to-speech.

(2) (Transaction Tracking System) Software that monitors a transaction until completion. In the event of a hardware or software failure, it ensures that the database is brought back to its former state before the attempt to
 (text-to-speech) pronunciation, emphasis and intonation intonation

In phonetics, the melodic pattern of an utterance. Intonation is primarily a matter of variation in the pitch level of the voice (see tone), but in languages such as English, stress and rhythm are also involved.
; and ECMAScript, the scripting language A high-level programming, or command, language that is interpreted (translated on the fly) rather than compiled ahead of time. A scripting, or script, language may be a general-purpose programming language or it may be limited to specific functions used to augment the running of an  specification. Moreover, SALT has been submitted to the W3C's Voice Browser working group, and some of its concepts may be incorporated into the next VoiceXML standard.

VoiceXML and SALT are both presentation layer languages that deliver a number of benefits. First, they are associated with a Web development model familiar to most programmers. Second, they support flexible deployment architectures--the voice browser and server application can be co-located or separated, and can be managed by the same or different entities. Third, they offer the prospect of application portability across different vendor platforms.

Much More Is Needed For High Usability

Despite these benefits, developing speech applications remains a complex undertaking. Industry estimates for delivering a customer-facing speech application of moderate complexity range from 3,000 to 6,000 person hours (including requirements analysis (project) requirements analysis - The process of reviewing a business's processes to determine the business needs and functional requirements that a system must meet. , dialog design, coding, source system integration, audio processing, testing and tuning), and first-time efforts can be considerably longer.

Building a highly usable speech system with existing VoiceXML and SALT tools is costly, slow and difficult. Most tools implement a development model similar to that used for creating a workflow application A workflow application is where various applications, components and people must be involved in the processing of data to complete an instance of a process. For example, consider a purchase order that moves through various departments for authorization and eventual purchase.  or a touch-tone menu tree. The developer is provided a palette of dialog components and a canvas on which these components can be sequenced with some transitional logic. The dialog components encapsulate en·cap·su·late
v.
1. To form a capsule or sheath around.

2. To become encapsulated.



en·cap
 all of the prompts, grammars and presentation code (VoiceXML or SALT) required to collect a particular type of data item, such as a date, dollar amount or credit card number.

Unfortunately, dialog components are usually too atomic--they process a single question and answer containing a single data item. To implement an application of any sophistication so·phis·ti·cate  
v. so·phis·ti·cat·ed, so·phis·ti·cat·ing, so·phis·ti·cates

v.tr.
1. To cause to become less natural, especially to make less naive and more worldly.

2.
, the developer has to manually write new components to handle more complex responses (such as user utterances that contain multiple pieces of information), as well as code the logic for any "off-topic" response; that is, a response that does not directly answer the question posed. For example, consider the following conversation whereby a caller attempts to reconfirm re·con·firm  
tr.v. re·con·firmed, re·con·firm·ing, re·con·firms
To confirm again, especially to establish or support more firmly: reconfirmed the reservations.
 his or her flight details with a human agent:

Agent: Do you have your confirmation number?

Caller: Um, no, but I'm flying out of Dallas on Friday.

[The caller does not provide the confirmation number as requested, but rather gives some details about the flight.]

Agent: OK, departing from Dallas. Are you leaving on Friday, January 28th or Friday, February 4th?

[The agent passively confirms the recognized departure airport and then attempts to clarify the actual departure date.]

Caller: I think my wife made the reservation for the fourth.

Agent: OK, Friday, February 4th. And around what time is the flight?

[The agent realizes that the date alone is not sufficient to retrieve the reservation and asks for the approximate time.]

Caller: 10:30

Agent: Is that a.m. or p.m.?

[The caller response is incomplete, so the agent asks a follow-up question.]

The above example illustrates that user responses in a speech application are much more varied and less structured than in a visual application. Callers may respond in many different ways due to differences in their objectives, the information they have at hand, their level of understanding and their interaction style. To achieve high usability, a speech application must be able to guide callers toward a desired outcome while allowing them latitude in their responses, such as the following elements:

* Callers may provide information in an arbitrary order of their own choosing;

* Callers may use superfluous words in their responses;

* Callers may provide multiple pieces of information in a single spoken utterance;

* Callers may provide--in a single utterance--only a subset of information requested by the application;

* Callers may clarify or correct the application's interpretation of information they have provided; and

* Callers may modify earlier responses in subsequent utterances.

Speech applications present a new user interaction model--one significantly distinct from the graphical user interface graphical user interface (GUI)

Computer display format that allows the user to select commands, call up files, start programs, and do other routine tasks by using a mouse to point to pictorial symbols (icons) or lists of menu choices on the screen as opposed to having to
 (GUI (Graphical User Interface) A graphics-based user interface that incorporates movable windows, icons and a mouse. The ability to resize application windows and change style and size of fonts are the significant advantages of a GUI vs. a character-based interface. ) model well known to all computer users. A voice user interface (VUI (Voice User Interface) A voice-controlled application on a computer, PDA or smartphone. A VUI is more sophisticated than an interactive voice response (IVR) system. It implies a wide range of commands rather than just voicing "yes" or "no." Contrast with GUI. ) requires specialized design and implementation expertise. An effective interface is critical for success in any speech application and call center system. Inexperienced callers must find the VUI intuitive. The VUI should employ natural and flexible strategies to accept information and to guide callers along the call. It should collect information in a fast and efficient manner by avoiding repetitive or lengthy prompts.

For any customer service call, there might be a straightforward path the developer hopes callers will take. In reality, there are a multitude of different paths callers will actually take, because callers have different goals, different information at hand, different levels of comprehension, or different interaction styles. At each point in the conversation, the caller may answer the current question, or may stray from the direct path by reviewing previous responses, starting another train of thought or jumping to another part of the application. As a result, the richer the desired user experience, the more paths the developer must provide.

Current development tools facilitate the construction of a call path, but still require each path to be manually designed and configured. This approach is not practical for anything more than the simplest interactions, as the number of paths quickly becomes unmanageable. Furthermore, to improve usability, the developer must add, alter or remove paths by hand, which is untenable from a maintenance perspective.

Changing The Equation: A New Approach To Speech Development

A better approach is to drive application development at the conversation level, which shields the programmer from the complexity of designing and implementing every possible call path. In this approach, the development tool would provide a set of services that model the conversation skills commonly encountered in customer service calls, and would construct the call paths accordingly.

For example, a conversation skill is disambiguation dis·am·big·u·ate  
tr.v. dis·am·big·u·at·ed, dis·am·big·u·at·ing, dis·am·big·u·ates
To establish a single grammatical or semantic interpretation for.
, which is the act of determining a single interpretation among two or more plausible interpretations derived from the caller's response. Using current tools, disambiguation would be manually implemented by inserting after each existing dialog an additional dialog that asks the caller to select one value among a set of ambiguous results. By contrast, a tool that understands the concept of disambiguation could automatically generate the disambiguation call path whenever multiple interpretations arise. A more complex conversation skill is goal-seeking behavior, the ability to process the caller's response in the context of the objectives of the conversation. In the previous flight reconfirmation example, this skill allowed the agent to understand the caller's departure airport and date even though the question asked was actually a request for a confirmation number. A development tool that is aware of goal-seeking behavior could automatically construct the numerous possible call paths when preconfigured Set up ahead of time. It implies that the device or software application has been modified to suit the customer or situation. See ghosting server.  with an objective, such as obtaining a flight itinerary.

By recognizing and codifying these and many other common conversation skills, a speech development tool would allow developers to implement rich and natural conversations with minimal effort. This approach achieves great savings in development cost and complexity for demanding customer-facing systems.

Open standards such as VoiceXML and SALT are necessary components for the mainstream adoption of speech automation systems. These standards offer a Web-based development model that is already familiar to IT organizations. However, they are not sufficient. Current speech development tools still leave too much of the hard work to the developer: conversation skills and other elements of the voice interface paradigm, such as goal-seeking behavior, flexible recognition, navigation, clarification and correction, must be reinvented and implemented for every speech system. Given the relative newness of the speech paradigm, these requirements can prove over-whelming to the developer. Speech tools and platforms will have to better facilitate the implementation of high usability capabilities before enterprises can consistently deliver high-quality customer service through their speech systems.

If you are interested in purchasing reprints of this article (in either print or HTML format), please visit Reprint Management Services online at www.reprintbuyer.com or contact a representative via e-mail at reprints@tmcnet.com or by phone at 800-290-5460.

For information and subscriptions, visit www.TMCnet.com or call 203-852-6800.

by Patrick Nguyen

Voxify

Partrick Nguyen is the chief technology officer of Voxify, which creates automated agents with the ability to handle advanced customer service calls for call centers. He began his software development career at Australia's Telstra Research Labs. Patrick has also worked for McKinsey & Company, and he has an MBA MBA
abbr.
Master of Business Administration

Noun 1. MBA - a master's degree in business
Master in Business, Master in Business Administration
 from MIT's Sloan School and a B.S. in Electrical Engineering electrical engineering: see engineering.
electrical engineering

Branch of engineering concerned with the practical applications of electricity in all its forms, including those of electronics.
 from the University of Melbourne
  • AsiaWeek is now discontinued.
Comments:

In 2006, Times Higher Education Supplement ranked the University of Melbourne 22nd in the world. Because of the drop in ranking, University of Melbourne is currently behind four Asian universities - Beijing University,
.
COPYRIGHT 2005 Technology Marketing Corporation
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2005, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:SPEECH-WORLD[TM]
Author:Nguyen, Patrick
Publication:Customer Interaction Solutions
Geographic Code:1USA
Date:Mar 1, 2005
Words:2044
Previous Article:Pronunciation Lexicon Specification (PLS) version 1.0.(Definition Du Jour)(standards of World Wide Web Consortium)(Brief Article)
Next Article:IVRs gone wild.(Last Call)(Interactive Voice Response systems)
Topics:



Related Articles
SpeechWorks Announces Availability of Open Source VoiceXML Interpreter and SIP-based Speech Links.
TuVox Announces Team TuVox Partner Program.
Convergys Adds TuVox Conversational Voice Response Applications to SpeechPort Platform.
Nuance Voice Platform 3.0 Addresses Key Speech Market Need by Simplifying the Design, Deployment and Maintenance of Voice Automation Solutions.
Voicexml versus salt: selecting a voice application standard.(Call Center/CRM Management Scope)(Interactive voice response)
Speech recognition for the contact center.(Call Center/CRM Management Scope)
Standards-Based Speech Platforms Gain Momentum; Companies Choose Nuance Voice Platform to Reduce Deployment Costs, Improve Performance and Customer...
Pronexus VeoDesign, development/testing suite for Microsoft Speech Server, receives application certification from Microsoft.(SPEECH-WORLD[TM])
Latest trends and best practices in speech applications.(SPEECH-WORLD[TM])
The Speech Technology Excellence Awards.(Customer Inter@ction Solutions)

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles