Beyond salt versus VoiceXML: coping with the wealth of standards in speech and multimodal self-service applications. (Call Center/CRM Management Scope).K. W. (Bill) Scholz Standards serve as the foundation for growth within an industry. As a new technology is spawned and begins to pique the interest of developers and consumers, its initial growth is typically haphazard and devoid of structure. As the technology reaches adolescence, however, its leaders develop standards that guide growth and interoperability, and its haphazard evolution fades. As the technologies enabling speech and multimodal self-service applications mature, many standards have emerged and combined to enable the field to approach mainstream status. The growth of standards is not without its cost, however; because of the complexity of the underlying technologies, the standards documents themselves have grown to span thousands of pages, and as a consequence constitute an overwhelming obstacle to a developer's mastery of the technology. Furthermore, this past year has seen considerable press devoted to the so-called "conflict" between the two key standards in our industry: SALT and VoiceXML (VXML See VoiceXML. ). Claims of conflict have deluded some developers into feeling pressure to make premature "choices" between them, while intimidating others into inactivity as they wait for the industry to choose the "right" one. In fact, there are over a dozen distinct standards designed to guide the development and execution of speech and multimodal applications, occasionally competing with one another but more frequently operating in harmony to guide distinct components of the application's architecture. Deployment Architecture Figure 1 (on page 54) illustrates the deployment architecture for a speech or multimodal application. The major components in the architecture and their functions ate as follows: Application Server. The central component is the application server, the platform and software responsible for managing the execution of the application. The application server's principal responsibilities include management of the dialog with the end-user and management of the business transaction processor, the application's business functionality. Business transaction processor. This term describes the software and (optionally) the platform responsible for execution of the business transactions (for example, a travel reservation system, a retail banking database, a regional or national weather repository, or a securities transaction database, to name a few). Voice gateway. During execution, the application server interchanges information with the voice gateway that is coded in a markup language and is conveyed using the familiar Internet delivery paradigm. The voice gateway includes: * A markup language interpreter, * An automatic speech recognizer (ASR (Automatic Speech Recognition) Using voice recognition to replace keypad entry for telephone voice menus. Typically used to speak the digits 0 through 9 insted of keying them, ASR systems may be able to recognize a limited vocabulary. See voice recognition and AVSR. ), * A text-to-speech (TTS) generator, and * A telephone network interface (tele interface). The tele interface mediates the connection through the circuit-switched or packet-switched telephone network to the end user, The network connection will use either a direct digital interface to the circuit-switched network or voiceover-IP (VoIP) through a media gateway to the telephone network. Voice user interface. This is an end user interface using speech over wireless or wireline telephones. Graphics user interface. This is an end user interface using desktop PCs, PDAs, cell phones with digital visual displays, or other screen-oriented devices. Standards The principal standards and standardized APIs (application program interfaces) that guide the operation and interaction of the components in the architecture are shown in Figure 1, and are listed and described below. The agency responsible for each standard or API is shown in parentheses after the standard's name. CCXML CCXML Call Control Extensible Markup Language CCXML Call Control Xml Interpreter (W3C). Call Control eXtensible Markup Language Call Control eXtensible Markup Language (CCXML) is an XML standard designed to provide telephony support to VoiceXML. Its current status is a W3C Working Draft, adopted 19 January 2007. is designed to provide telephony call control support for dialog systems. CCXML is intended to serve as an adjunct language for use with a VXML, SALT or other dialog implementation platform. HTTP HTTP in full HyperText Transfer Protocol Standard application-level protocol used for exchanging files on the World Wide Web. HTTP runs on top of the TCP/IP protocol. (IETF See Internet Engineering Task Force. IETF - Internet Engineering Task Force ). Hypertext Transfer Protocol See HTTP. (protocol) Hypertext Transfer Protocol - (HTTP) The client-server TCP/IP protocol used on the World-Wide Web for the exchange of HTML documents. It conventionally uses port 80. Latest version: HTTP 1.1, defined in RFC 2068, as of May 1997. is an application-level protocol for distributed, collaborative, hypermedia information systems. It is a genetic, stateless protocol which can be used for many tasks beyond its use for hypertext, such as name servers and distributed object management systems, through extension of its request methods, error codes and headers. H.323 (ITU). H.323 is a standard that specifies the components, protocols and procedures that provide multimedia communication services -- real-time audio, video and data communications -- over packet networks, including Internet protocol (IP)-based networks. H.323 is part of a family of recommendations that provide multimedia communication services over a variety of networks. JDBC (Java DataBase Connectivity) A programming interface that lets Java applications access a database via the SQL language. Since Java interpreters (Java Virtual Machines) are available for all major client platforms, this allows a platform-independent database (Sun Microsystems). Java Database Connectivity (database, programming) Java Database Connectivity - (JDBC) Part of the Java Development Kit which defines an application programming interface for Java for standard SQL access to databases from Java programs. http://java.sun.com/products/jdk/1.1/docs/guide/jdbc/index.html. is an API that lets developers access virtually any tabular data source from the Java programming language. It provides cross-DBMS connectivity to a wide range of SQL databases and, with the JDBC API, it also provides access to other tabular data sources, such as spreadsheets or flat files. ODBC (Open DataBase Connectivity) A database programming interface from Microsoft that provides a common language for Windows applications to access databases on a network. (Microsoft). Online Database Connectivity is a widely accepted API for database access. It is based on the Call-Level Interface (CLI) specifications from X/Open and ISO/TEC for database APIs and uses Structured Query Language See SQL. Structured Query Language - SQL (SQL) as its database access language. SALT (W3C). Speech Application Language Tags For other meanings of the word salt or acronym "SALT", see salt (disambiguation). Speech Application Language Tags (SALT) is an XML based markup language that is used in HTML and XHTML pages to add voice recognition capabilities to web based applications. is a platform-independent standard that makes possible multimodal and telephony-enabled access to information, applications and Web services from PCs, telephones, tablet PCs and wireless PDAs (personal digital assistants). The standard extends existing mark-up languages such as HTML HTML in full HyperText Markup Language Markup language derived from SGML that is used to prepare hypertext documents. Relatively easy for nonprogrammers to master, HTML is the language used for documents on the World Wide Web. , XHTML (EXtensible HTML) A markup language for Web pages from the W3C. XHTML combines HTML and XML into a single format (HTML 4.0 and XML 1.0). Like XML, XHTML can be extended with proprietary tags. Also like XML, XHTML must be coded more rigorously than HTML. and XML, SIP, RTP, MGCP See MGCP/MEGACO. MGCP - Media Gateway Control Protocol (IETF). SIP (Session Initiation Protocol (protocol) Session Initiation Protocol - (SIP) A very simple text-based application-layer control protocol. It creates, modifies, and terminates sessions with one or more participants. Such sessions include Internet telephony and multimedia conferences. It is described in RFC 2543. ) is a signaling protocol for Internet conferencing, telephony, presence, events notification and instant messaging. RTP (Real-time Transport Protocol (protocol) Real-Time Transport Protocol - (RTP) An Internet protocol for transmitting real-time data such as audio and video. RTP itself does not guarantee real-time delivery of data, but it does provide mechanisms for the sending and receiving applications to support streaming ) is a protocol for the transport of real-time data, including audio and video. MGCP/MEGACO (Media Gateway Control Protocol) addresses the relationship between the media gateway, which converts circuit-switched voice to packet-based traffic, and the media gateway controller (sometimes called a softswitch), which dictates the service logic of that traffic. SRGS SRGS Speech Recognition Grammar Specification SRGS Stimulated Raman Gain Spectroscopy SRGS Survivable Relay Ground Stations (W3C). Speech Recognition Grammar Specification Speech Recognition Grammar Specification (SRGS) is an W3C recommendation that defines syntax for representing grammars for use in speech recognition so that developers can specify the words and patterns of words to be listened for by a speech recognizer. defines the syntax for grammar representation intended for use by speech recognizers and other grammar processors so that developers can specify the words and patterns of words to be listened for by a speech recognizer. SSML SSML Speech Synthesis Markup Language (W3C). Speech Synthesis Markup Language Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's voice browser working group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. is a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. Its essential role is to give authors of synthe-sizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc., across different synthesis-capable platforms. SS7/ISUP (IETF). Signaling System 7 is an architecture for performing out-of-band signaling in support of the call-establishment, billing, routing and information-exchange functions of the PSTN (Public Switched Telephone Network) The worldwide voice telephone network. Once only an analog system, the heart of most telephone networks today is all digital. In the U.S. (public switched telephone network). It identifies functions to be performed by a signaling-system network and a protocol to enable their performance. ISUP (ISDN User Part) See SS7. (ISDN User Part The ISDN User Part or ISUP is part of the Signaling System #7 which is used to set up telephone calls in Public Switched Telephone Networks. It is specified by the ITU-T as part of the Q.76x series, ANSI (T1.113-YEAR) and Telcordia former Bellcore GR-246 series. ) defines the messages and protocol used in the establishment and tear down of voice and data calls over the PSTN, and to manage the trunk network on which they rely. VoiceXML (W3C). VoiceXML (Voice eXtensible Markup Language See XML. (language, text) Extensible Markup Language - (XML) An initiative from the W3C defining an "extremely simple" dialect of SGML suitable for use on the World-Wide Web. http://w3.org/XML/. ) is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF (Dual-Tone MultiFrequency) The type of audio signals that are generated when you press the buttons on a touch-tone telephone. See also DMTF. DTMF - Dual Tone Multi Frequency key input, recording of spoken input, telephony and mixed-initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications. WAP / WML (OMA). Wireless Application Protocol and Wireless Markup Language Wireless Markup Language, based on XML, is a content format for devices that implement the Wireless Application Protocol (WAP) specification, such as mobile phones, and preceded the use of other markup languages now used with WAP, such as XHTML and even standard HTML (which are refer to a markup language based on XML which is intended for use in specifying content and user interface for narrow band devices, including cellular phones and pagers. XHTML (W3C). eXtended HyperText Markup Language (hypertext, World-Wide Web, standard) Hypertext Markup Language - (HTML) A hypertext document format used on the World-Wide Web. HTML is built on top of SGML. "Tags" are embedded in the text. A tag consists of a "<", a "directive" (in lower case), zero or more parameters and a ">". is a family of current and future document types and modules that reproduce, subset and extend HTML 4. The XHTML document types are XML-based and ultimately are designed to work in conjunction with XML-based user agents. XML (W3C). eXtensible Markup Language is a simple, vety flexible text format derived from SGML SGML in full Standard Generalized Markup Language Markup language for organizing and tagging elements of a document, including headings, paragraphs, tables, and graphics. (Standard Generalized Markup Language (language, text) Standard Generalized Markup Language - (SGML) A generic markup language for representing documents. SGML is an International Standard that describes the relationship between a document's content and its structure. ). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. X+V (W3C). XHTML + Voice brings spoken interaction to standard Web content by integrating a set of mature Web technologies such as XHTML and XML Events with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, speech grammars and the ability to attach voice event handlers. Application Creation It is clear that if an application developer were required to attend specifically to the derails of every standard during the development process, application creation would become prohibitively complex. Yet it is equally clear that the evolution of standards plays a viral role in facilitating inter-vendor operability and modularization, and has become the lifeblood of growth in our industry. The solution to this problem is found in today's collection of development tool suites and service creation environments. In recent years, these have grown in sophistication so·phis·ti·cate v. so·phis·ti·cat·ed, so·phis·ti·cat·ing, so·phis·ti·cates v.tr. 1. To cause to become less natural, especially to make less naive and more worldly. 2. to the point that the developer is shielded from the intricacies of standards conformity or enforcement, yet can derive the fall benefit of standard conformance. The retail shelves are lined with a collection of sophisticated tool suites and SCEs (Service Creation Environments) designed to address these problems. Developers produce speech and multimodal applications using a selected subset of these tools. Figure 2 illustrates how one can combine a carefully selected subset of these tools and packaged application delivery components to shield the developer from the need to explicitly master each of the standards inherent in the architecture. Application Development And Deployment Using The "Right" Tools The following description summarizes our application development process with special emphasis on the tools and delivery components used in each phase, and how standards are addressed without the need for specific focus on each. Planning and discovery. The development process starts with "planning and discovery' where the project management team interviews the customer to analyze the problem in detail to identify the application's purpose and methodology. Dialog design and evaluation. A user interface layout tool is used to express the applications methodology as an ordered collection of dialog "states," where each stare includes a prompt, expected responses to the prompt and lists of actions associated with each response. The same tool manages testing where the application's execution is simulated for candidate end users using operator-guided call flow. Grammar and prompt design. Once dialog design is completed. evaluated and modified as required, the detailed grammars are entered using a feature that employs a spreadsheet metaphor to refine the responses in each dialog state by entering anticipated words and phrases Words and Phrases® A multivolume set of law books published by West Group containing thousands of judicial definitions of words and phrases, arranged alphabetically, from 1658 to the present. . Additionally a prompt design tool is used to structure the verbal output for each dialog state to use any mixture of recordings and synthesized speech. Business transaction integration. Integration with the business transaction process is performed by building a connector. The tool supports creation of connectors to databases, legacy mainframe applications, and to any Web-based resource or site. Output from a connector consists of an XML or XHTML stream which is integrated into the code using J2EE conventions. Voice gateway integration. A voice gateway is selected which best meets a customer's needs, and the runtime engine is conditioned to produce the markup language stream (VoiceXML or SALT) appropriate to the selected platform. Voice gateway provider's tools are used to integrate the gateway into customer-specific circuit-switched or packet-switched networks. Application testing, tuning and delivery. Tuning and testing are performed using a combination of locally developed tools and tools provided by the speech recognizer and voice gateway vendors. Tuning is followed by piloting, beta testing and phased rollout as dictated by customer contracts. The past two or three years have seen outstanding growth in the speech application industry, and the start of an expansion into the adjacent multimodal application industry. No single factor is more important in stimulating this growth than the creation of cross-vendor and cross-industry standards. Yet the very abundance of new and maturing standards has led to an increased incentive to hide their arcane complexity in tools to facilitate service creation without the requirement to master details of each relevant standard. Fortunately, service creation tools and deployment platforms have also matured significantly and, because of the very standards they encapsulate, inter-operate to permit cross-vendor life cycle support for speech and mu1timodal applications. It is the growth of standards that makes this blossoming inter-operability possible. and provides the foundation for our industry to grow to maturity. At Unisys (www.unisys.com), Dr. Scholz managed the development of two large scale expert systems. Starting in 1991 as R&D manager, he managed business development for government service contracts. In 1994 he co-founded the NL Speech Solutions business unit and since then has been directing efforts to integrate speech recognition and natural language processing Natural language processing Computer analysis and generation of natural language text. The goal is to enable natural languages, such as English, French, or Japanese, to serve either as the medium through which users interact with computer systems such as in the creation of Spoken Language Understanding systems. He is a frequent speaker at professional trade shows and was selected as one of the Top Ten Leaders in speech by Speech Technology magazine in 2001. His commitment to standards is demonstrated by his participation as the Unisys representative to the SALT Forum, the VoiceXML Forum, and the W3C Voice Browser Working Group. |
|
||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion