Speech technologies for the 21st century. (Call Center/CRM Management Scope).Almost every vision of technology in the 21st century includes interacting with devices via speech. A computer might be asked to search a database, display a star chart or activate machinery. Just by speaking a name, communications are established no matter where the other person may be. In books, film and television, we constantly see speech as the most common way to interact with our artificial environment. While flying cars and orbiting hotels remain a distant dream, practical speech technology for input and output is a reality today. In fact, speaking to devices and hearing their spoken responses is rapidly becoming commonplace. The reason is simple: recent advances in speech technology have enabled new applications that offer dramatic return on investment. Today, we can place a telephone call to check flight arrival times, track packages, transfer bank funds or purchase office supplies Office supplies is the generic term that refers to all supplies regularly used in offices by businesses and other organizations, from private citizens to governments, who works with the collection, refinement, and output of information (colloquially referred to as "paper work"). without ever speaking to a human agent. Despite the relatively low cost of such systems, they are designed to deliver high customer satisfaction and do so all day, every day. Speech technology is changing the way we conduct business over the telephone, and soon will be changing the way we control devices and access information no matter where we may be. Two separate but related speech technologies are responsible for this revolution, Each has been available commercially for decades but only through remarkable progress in the past few years have they matured sufficiently for mainstream applications. Because speech is so natural to us, it may seem that it would be easy for a computer to manage. In fact, speech is remarkably complex in subtle and surprising ways that have been discerned only through clever experimentation and analysis. Humans are adept at speech because, through millions of years of evolution, we have developed brain, vocal tract vocal tract n. The airway used in the production of speech, especially the passage above the larynx, including the pharynx, mouth, and nasal cavities. and auditory auditory /au·di·to·ry/ (aw´di-tor?e) 1. aural or otic; pertaining to the ear. 2. pertaining to hearing. au·di·to·ry adj. specializations that enhance our ability to listen and speak. In a much shorter time we have been able to invent similar mechanisms so that computers can now also listen and speak. Speech recognition is the technology that enables computers to hear by determining which words were spoken. It starts by capturing an audio signal and processing it through sophisticated algorithms that mimic some of the processing performed by the human ear, The sounds are evaluated to determine which phonemes, the basic constituents of speech, they might represent. The possible strings of phonemes are then compared against a grammar of allowable words and phrases Words and Phrases® A multivolume set of law books published by West Group containing thousands of judicial definitions of words and phrases, arranged alphabetically, from 1658 to the present. to determine what was most likely said, resulting in a textual representation of what was spoken, Based on these words, the computer can conduct further actions such as placing calls or buying stocks. Older speech recognition technology required greatly constraining con·strain tr.v. con·strained, con·strain·ing, con·strains 1. To compel by physical, moral, or circumstantial force; oblige: felt constrained to object. See Synonyms at force. 2. the task to simplify the computation needed, perhaps by understanding only a single speaker's voice, requiring pauses between words or limiting the vocabulary to a handful of carefully selected phrases. However, today's commercial speech recognition products are far less restrictive, They understand most everyone's speech no matter how strong the accent, allow callers to speak naturally and spontaneously, reject background noise and line distortions, and support virtually unlimited vocabularies, These properties allow speech recognition to handle tasks that previously could be assigned only to a human agent, such as entering names or addresses, while delivering accuracy that rivals human listening performance. Telephony applications, with or without speech recognition, often respond to callers by playing recorded speech. This works well when the possible responses can be enumerated This term is often used in law as equivalent to mentioned specifically, designated, or expressly named or granted; as in speaking of enumerated governmental powers, items of property, or articles in a tariff schedule. in advance and are relatively few in number, but it cannot be used to read an email message or provide an address listing. Speech synthesis speech synthesis Generation of speech by artificial means, usually by computer. Production of sound to simulate human speech is referred to as low-level synthesis. High-level synthesis deals with the conversion of written text or symbols into an abstract representation of , also called "text-to-speech" or "TTS (1) See text-to-speech. (2) (Transaction Tracking System) Software that monitors a transaction until completion. In the event of a hardware or software failure, it ensures that the database is brought back to its former state before the attempt to ", is the technology that enables computers to speak arbitrary phrases. It starts by analyzing the text to be spoken, converting strings such as "$3.50" into "three dollars and fifty cents" and determining how each word is pronounced. This conversion needs to be sophisticated enough to know when the abbreviation abbreviation, in writing, arbitrary shortening of a word, usually by cutting off letters from the end, as in U.S. and Gen. (General). Contraction serves the same purpose but is understood strictly to be the shortening of a word by cutting out letters in the middle, "Dr." is pronounced as "doctor" and when it is pronounced "drive," or when "read" is pronounced like "red" or "reed." Appropriate pitch, timing and emphasis must be assigned to words in each sentence to avoid producing a grating monotone mon·o·tone n. 1. A succession of sounds or words uttered in a single tone of voice. 2. Music a. A single tone repeated with different words or time values, especially in a rendering of a liturgical text. . Only then can an audio stream be generated. You may have heard phone numbers read to you by concatenating recordings of each digit. Today's leading speech synthesis technique takes a similar approach but on a much finer scale, dissecting dis·sect tr.v. dis·sect·ed, dis·sect·ing, dis·sects 1. To cut apart or separate (tissue), especially for anatomical study. 2. recorded speech into tiny stretches and reassembling them to form the requested phrases. The output is so natural sounding that it can be difficult to discern from an original recording a startling star·tle v. star·tled, star·tling, star·tles v.tr. 1. To cause to make a quick involuntary movement or start. 2. To alarm, frighten, or surprise suddenly. See Synonyms at frighten. contrast to earlier approaches that were highly intelligible but had a mechanical quality. The synthetic output sounds so much like the original speaker's voice that it is possible to smoothly blend natural and synthetic speech synthetic speech n. Speech that is produced by an electronic synthesizer activated by a keyboard, enabling individuals who are incapable of speech to communicate. into a single, spoken response without noticeable transitions. Bringing together high-performance speech recognition and natural-sounding speech synthesis allows developers to create applications that engage callers with dialog. Doing so effectively is nor easy, as designers must draw on their experience to combine art and science into a plan covering each step in the conversation. Should prompts be friendly or formal? How much guidance should be offered and when? What if the caller makes a mistake? Even brief exchanges may harbor hidden complexity that is best exposed through observing the behavior of actual callers, and subtle changes can result in dramatic differences in usability. Through careful wording of prompts and anticipation of possible responses, designers can create an experience that allows almost every caller to complete tasks efficiently by answering a series of directed questions. The result is a much lower cost per call compared to live agents without the frustration induced by lengthy touch-tone menus. In fact, thanks to shorter hold queues and the ability of seasoned callers to interrupt prompts, speech-driven applications often provide the highest caller satisfaction possible. Today's speech recognition and synthesis technologies depend on copious co·pi·ous adj. 1. Yielding or containing plenty; affording ample supply: a copious harvest. See Synonyms at plentiful. 2. processing power and memory, usually available only in server systems installed within contact centers, and enabled by the plummeting cost of computation and storage. However, the same advances we have seen in server computing are beginning to appear in handheld devices, too. While size, weight and battery life restrictions dictate that handheld devices will never be as capable as their server brethren, compelling speech recognition and synthesis technology embedded Inserted into. See embedded system. in consumer devices is now possible. The convergence of several other technology trends is creating exciting new possibilities for applications on handheld devices. Color flat panel displays A thin display screen for computer and TV usage. The first flat panels appeared on laptop computers in the mid-1980s, and the LCD technology became the standard. Stand-alone LCD screens became available for desktop computers in the mid-1990s and exceeded sales of CRTs for the first time are becoming inexpensive and lightweight with dazzling image quality, making it possible to view photos, intricate diagrams and even video. Battery capacity is increasing while device power consumption is decreasing, reducing device weight while extending the time between charges. Wireless data networking is becoming pervasive for both long- and short-distance connections, enabling devices to access the Internet and work cooperatively. In addition, location tracking is becoming less expensive and more accurate, enabling a new category of location-based services See mobile positioning. . The Natural Interface For Handheld Computers A computing device that can be easily held in one hand while the other hand is used to operate it. The Palm devices are a popular example. See Palm, smartphone and palmtop. Emerging handheld devices are powerful computing platforms See platform. that combine and transcend the capabilities of existing PDAs and mobile phones, delivering functions traditionally reserved for full-fledged computers. Yet, these devices are too small for practical keyboards, making extensive data entry complicated and error prone. How will consumers tap the tremendous power of these devices? How will they access valuable information no matter where they are and what they are doing? With speech, of course. Speech provides an intuitive interface for these complex devices that allows users to concentrate on what needs to be done rather than on how it can be accomplished. The speech technologies used to control a device cannot be located in the network because accessing them would consume too much power, have excessively long latencies and be highly unreliable and costly. Instead, the speech technologies must be embedded in the device itself, tapping network resources only when needed to handle the most complicated tasks. Using speech recognition to dial a phone number in your personal address book might be accomplished using embedded technology. Using speech recognition to search for a phone number in the city directory might be accomplished using network-based technology. Both embedded speech recognition and embedded speech synthesis are required to produce a true conversational interface. Although speech can provide an intuitive interface, it alone is not a complete solution. Speech is poorly suited to data that demand privacy and could be overheard, such as passwords or PIN numbers, Speech is also ill-advised in situations, such as business meetings, where courtesy demands silence. The true answer lies in multimodal Two or more modes of operation. The term is used to refer to a myriad of functions and conditions in which two or more different methods, processes or forms of delivery are used. On the Web, it refers to asking for something one way and receiving the answer another; for example requesting interfaces, accepting text, pointing and speech as input and producing text, graphics and speech as output. Together these methods create a universal interface that can provide information to anyone, anytime, anywhere, even when driving an automobile. The coming generation of handheld devices will be produced by many manufacturers and will offer capabilities to access information and applications from a diverse array of sources. It must be possible for a device to load any content the user might need, regardless of who created that content. Similarly, a content provider will want to ensure its wares are available on any device, regardless of the manufacturer. The only way such an ecosystem can develop is through agreed-upon standards for representing data and controlling the device's multimodal inter face capabilities. This will allow a map vendor to specify the action taken when a user points to a particular location or says "zoom out." Thanks to the efforts of Web developers, excellent standards exist today for text, graphics and other media distributed to millions of desktop browsers. That addresses most of what is needed for tomorrow's multimodal applications, but a common means for controlling the spoken interface is missing. The leading approach to fill this void, known as the SALT (Speech Application Language Tags For other meanings of the word salt or acronym "SALT", see salt (disambiguation). Speech Application Language Tags (SALT) is an XML based markup language that is used in HTML and XHTML pages to add voice recognition capabilities to web based applications. ) specification, is currently under development by an industrywide in·dus·try·wide adv. & adj. Throughout an entire industry: sales that have decreased industrywide; industrywide cooperation. consortium, The SALT specification builds on the strong base of Web standards Web standards is a general term for the formal standards and other technical specifications that define and describe aspects of the World Wide Web. In recent years, the term has been more frequently associated with the trend of endorsing a set of standardized best practices for , harmoniously adding just what is necessary to control speech input and output. This approach works well with existing Web development tools, making it easier to voice-enable Web content and thereby accelerating adoption of multimodal applications. Within a few years, standards-based multimodal interfaces will make it possible to retrieve a map of your current location, find a nearby restaurant, read a review of its menu and call to reserve a table, all with a few spoken commands utt ered into a single device that fits comfortably in your hand. The complex artificial intelligence behind HAL Hal: see Halle, Belgium. hal In Sufism, a state of mind reached from time to time by mystics during their journey toward God. The ahwal (plural of hal) are God-given graces that appear when a soul is purified of its attachments to the material world. , as envisioned by Arthur C. Clarke Sir Arthur Charles Clarke, CBE (born 16 December 1917) is a British science-fiction author and inventor, most famous for his novel , and for collaborating with director Stanley Kubrick on the . [n 2001: A Space Odyssey, is still only science Fiction, Yet artificial spoken communication is available today and offers a compelling business case to drive its adoption. It will nor be long before we each place a phone call at least once per day that is answered by a computer, and soon we will all carry a device that allows us to effortlessly tap the vast information resources (1) The data and information assets of an organization, department or unit. See data administration. (2) Another name for the Information Systems (IS) or Information Technology (IT) department. See IT. of the Internet. Whereas today's children Today's Children was the first nationally syndicated radio soap opera in the United States. Created and written by Irna Phillips, it aired from flagship station WMAQ in Chicago from 1932 to 1938, and later in national syndication (without the involvement of WMAQ) from 1943 expect television to provide a color image A (digital) color image is a digital image that includes color information for each pixel. For visually acceptable results, it is necessary (and almost sufficient) to provide three samples (color channels and stereo sound, tomorrow's children will find it difficult to imagine technology incapable of conversation. For information and subscriptions visit www.TMCnet.com or call 203-852-6800. Rob Kassel is product marketing manager of Emerging Technologies for Speech Works International, Inc. (www.speechworks.com) |
|
||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion