Vowels: a voice-based Web engine for locating speeches.
The most common interaction with the Web is through the
ubiquitous browser using visual interfaces. Most search
engines use traditional visual interfaces to interact with its
users. However, for certain applications and users it is
desirable to have other modes of interaction with the Web.
Current interfaces limit the mobility of the user and his/her
interaction with the Web because both hands and eyes must be
involved in the task. Speech technology will promote an
increased use of the Web in untapped environments in a similar
way that cell telephones have promoted the increased use of
telephones. One of the limitations of the development of voice
systems was the lack of standards for creating spoken dialogue
systems. A promising emerging technology for solving this
problem is VoiceXML. VoiceXML brings the Web and content
delivery together in voice response applications in an
easy-to-use manner. This article presents a voice-based search
engine implemented in VoiceXML and narrates the design and
development issues behind the application.
********** Currently, the most common interaction with the Web is visual and accomplished through the use of the keyboard or mouse. While sound files can be incorporated as part of the presentation, the user rarely interacts with a web page using speech. This orientation limits the mobility of the user and his/her interaction with the Web because both hands and eyes must be involved in a given task. In fact, most pervasive computing Refers to the use of computers in everyday life, including PDAs, smartphones and other mobile devices. It also refers to computers contained in commonplace objects such as cars and appliances and implies that people are unaware of their presence. devices today are used in a hands and eyes busy mode. The use of speech recognition and synthesis will remove this limitation and promises to provide more flexibility in the design and development of web interfaces (Hartman & Vila, 2001. The use of speech technology frees the user from the windows, icons, menus, and pointers (WIMP (operating system) WIMP - Windows, Icons, Menus and Pointers (or maybe Windows, Icons, Mouse, Pull-down menus). The style of graphical user interface invented at Xerox PARC, popularised by the Apple Macintosh and now available in other varieties such as the X Window System, ) interface (Boyce, 2000; Lucas, 2000; Schneiderman, 2000). This technology enables users to interact with the Web using voice commands. Therefore, if users do not have a computer connected to the Internet on hand, they could use a telephone instead to interact with the Web. Furthermore, this mode of interaction makes it more accessible for certain type of users (e.g., visually impaired). Speech technology will promote an increased use of the Web in, as yet, untapped environments in a similar way that cell telephones have promoted the increased use of telephones. Many of the speech interfaces today are similar to telephone response systems in which the user is expected to enter a preset preset Cardiac pacing A parameter of a pacemaker that is programmed permanently when manufactured number from a menu of choices. While these systems are common place today, they are viewed as limited because the user must remember the mapping to keys, there may not be an appropriate option, or navigation must proceed through a prescribed set of options, even if the users know exactly what they want to do (Boyce, 2000). A literature review reveals that researchers are working diligently to understand speech interaction and when its use is most successful. Several articles have reported on studies of the voice interaction of humans with computers and how to make speech interfaces more conversational and adaptive than voice response systems (Boyce, 2000; Lai, 2000; Lucente, 2000). Boyce investigated the use of key word driven and spoken dialogue systems and found that spoken dialogue systems were more flexible, but more complex to design. She also found that users preferred to interact with systems that referred to themselves as "I," but found no significant differences in the preference for casual versus formal speaking style. Furthermore, the research indicated that the right initial system greeting is essential for establishing user expectations and helping users determine how to proceed (Boyce, 2000). Schneiderman investigated the limitations of current speech interfaces, particularly the interaction of speech and physical activity (e.g., keyboard manipulation) in interfaces. He found that most humans find it easier to type and think concurrently than to speak and think concurrently. Thus, voice command users needed to review their work more often than keyboard users in a word processing word processing, use of a computer program or a dedicated hardware and software package to write, edit, format, and print a document. Text is most commonly entered using a keyboard similar to a typewriter's, although handwritten input (see pen-based computer) and environment. He concluded that an understanding of the cognitive processes Cognitive processes Thought processes (i.e., reasoning, perception, judgment, memory). Mentioned in: Psychosocial Disorders used in speech will aid interface designers in integrating speech in a more effective manner. Schneiderman also indicated that future uses of speech in web environments will not be as standalone components, but as complements to visual interfaces as part of a multimodal Two or more modes of operation. The term is used to refer to a myriad of functions and conditions in which two or more different methods, processes or forms of delivery are used. On the Web, it refers to asking for something one way and receiving the answer another; for example requesting interface (Schneiderman, 2000). Multimodal interaction Multimodal interaction provides the user with multiple modes of interfacing with a system beyond the traditional keyboard and mouse input/output. The most common such interface combines a visual modality (e.g. , which includes speech, is part of a paradigm shift A dramatic change in methodology or practice. It often refers to a major change in thinking and planning, which ultimately changes the way projects are implemented. For example, accessing applications and data from the Web instead of from local servers is a paradigm shift. See paradigm. away from the use of WIMP interfaces. These systems have the potential for functioning more robustly than a single recognition-based technology such as speech. The design of these systems requires knowledge of the properties of the individual modes, such as speech, and the information content that accompanies them (Oviatt, 1999). Recognizing the potential of speech interfaces and the opportunities that it offers, a Voice-Based Web Engine for Locating Speeches (VOWELS) was developed as an alternative method to retrieve speeches from a media repository. This application allows users to search speeches using voice commands and dialogs over the telephone. The search space of VOWELS was restricted to speeches of prominent figures of the twentieth century. For example, the speech repository includes speeches such as "I have a dream ..." by Dr. King, "A date which will live in infamy Notoriety; condition of being known as possessing a shameful or disgraceful reputation; loss of character or good reputation. At Common Law, infamy was an individual's legal status that resulted from having been convicted of a particularly reprehensible crime, rendering him " by President Franklin D. Roosevelt, "... ask not what your country can do for you; ask what you can do for your country ..." by President John F. Kennedy "John Kennedy" and "JFK" redirect here. For other uses, see John Kennedy (disambiguation) and JFK (disambiguation). John Fitzgerald Kennedy (May 29, 1917–November 22, 1963), was the thirty-fifth President of the United States, serving from 1961 until his assassination in , and "That's one small step for man; one giant leap for mankind ..." by astronaut Neil Armstrong. VOWELS was implemented using VoiceXML, one of the most promising emerging technologies for implementing voice web applications. The remainder of the article is organized as follows: (a) describes VoiceXML as one of the emerging technologies; (b) the architecture of VOWELS; (c) two scenarios of the use of VOWELS; and finally, (d) the summary and conclusions. VOICE EXTENSIBLE MARKUP LANGUAGE See XML. (language, text) Extensible Markup Language - (XML) An initiative from the W3C defining an "extremely simple" dialect of SGML suitable for use on the World-Wide Web. http://w3.org/XML/. (VOICEXML) Voice eXtensible Markup Language (Voice XML XML in full Extensible Markup Language. Markup language developed to be a simplified and more structural version of SGML. It incorporates features of HTML (e.g., hypertext linking), but is designed to overcome some of HTML's limitations. ) is a language developed and promoted by the VoiceXML Forum founded in 1999 by AT&T, IBM (International Business Machines Corporation, Armonk, NY, www.ibm.com) The world's largest computer company. IBM's product lines include the S/390 mainframes (zSeries), AS/400 midrange business systems (iSeries), RS/6000 workstations and servers (pSeries), Intel-based servers (xSeries) , Lucent, and Motorola (VoiceXML Forum, 2003). The VoiceXML 1.0 Specification was completed in March of 2000 and accepted by the World Wide Web Consortium (W3C (World Wide Web Consortium, www.w3.org) An international industry consortium founded in 1994 by Tim Berners-Lee to develop standards for the Web. It is hosted in the U.S. by the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT (www.csail.mit.edu/index.php). ) in May of 2000 (World Wide Web Consortium, 2003). The Forum now has over 370 member companies who are actively involved in using and promoting this new technology. VoiceXML is a W3C Recommendation A W3C Recommendation is the final stage of a ratification process of the World Wide Web Consortium (W3C) working group concerning the standard. It is the equivalent of a published standard in many other industries. that allows a web server to deliver voice dialogues to users over the phone by way of a voice server. It produces a bridge between computer telephony See CTI, VoIP and IP telephony. Computer Telephony - Computer Telephone Integration technology and web server technology. This bridge makes the Voice Web possible. VoiceXML is an eXtensible Markup Language (XML) that brings the Web, content delivery, and voice response applications together in an easy-to-use manner. XML (eXtensible Markup Language) is a specification for designing markup languages
A VoiceXML document defines a voice dialogue between a user and the system. The system can speak phrases or sentences to the user, either from prerecorded pre·re·cord tr.v. pre·re·cord·ed, pre·re·cord·ing, pre·re·cords To record (a television program, for example) at an earlier time for later presentation or use. Adj. 1. voice files or as output generated in real time using text-to-speech (TTS (1) See text-to-speech. (2) (Transaction Tracking System) Software that monitors a transaction until completion. In the event of a hardware or software failure, it ensures that the database is brought back to its former state before the attempt to ) synthesis. User input can be spoken words recognized by Automatic Speech Recognition (ASR (Automatic Speech Recognition) Using voice recognition to replace keypad entry for telephone voice menus. Typically used to speak the digits 0 through 9 insted of keying them, ASR systems may be able to recognize a limited vocabulary. See voice recognition and AVSR. ) technologies. The VoiceXML dialogue specifies what action to take based on user input. A VoiceXML dialogue may present a simple menu of choices for the user to select from, or it may be a more thorough type of interaction using forms. For example, a voice form might allow a user to fill in the fields for the name of a person giving the speech and/or the location where the speech was given to interact with VOWELS. Once a VoiceXML form has been filled out, the data is submitted to the web server in the same way data from an HTML HTML in full HyperText Markup Language Markup language derived from SGML that is used to prepare hypertext documents. Relatively easy for nonprogrammers to master, HTML is the language used for documents on the World Wide Web. form would be submitted. Then a server-side web application can use the submitted data to perform any kind of transactions (e.g., query to a database and return the results to the voice web browser The program that serves as your front end to the Web on the Internet. In order to view a site, you type its address (URL) into the browser's Location field; for example, www.computerlanguage.com, and the home page of that site is downloaded to you. as a new VoiceXML document.) [FIGURE 1 OMITTED] A sample VoiceXML document is presented in Figure 1. VOWELS: A VOICE-BASED WEB ENGINE FOR LOCATING SPEECHES The application described in this article is a Voice-Based Web Engine for Locating Speeches (VOWELS). The search space is restricted to selected speeches of famous and infamous figures of the twentieth century. The application is driven by a telephone as a front-end interface. The user gives in his/her choice of criteria for speech-search and accordingly retrieves the speech from a voice repository. The motivation behind the development of VOWELS was twofold--as a proof-of-concept as well as a research tool to investigate the usability of voice web interfaces. [FIGURE 2 OMITTED] VOWELS Architecture The VOWELS architecture is comprised of the following components: * Touch-tone phone / End-User * Voice Server (VoiceXML Gateway) * Web-Server Touch-tone phone/end user. To use VOWELS a user has to make a call using preferably a touch-tone phone. Once the user is authenticated au·then·ti·cate tr.v. au·then·ti·cat·ed, au·then·ti·cat·ing, au·then·ti·cates To establish the authenticity of; prove genuine: a specialist who authenticated the antique samovar. a VOWELS search can be performed. Voice server. The Voice Server used in this application is hosted by www.bevocal.com. Bevocal is a free voice server that supports Voice XML. Users gain access to VOWELS through a voice server with a toll-free call. The interaction with the application is voice-based and in the form of a dialog. VOWELS presents a set of alternatives to the user using text-to-speech synthesis and the user responds using voice or a keypad A small keyboard or supplementary keyboard keys; for example, the keys on a calculator or the number/cursor cluster on a computer keyboard. See programmable keypad. . The words spoken by the user are recognized by the voice server using automated speech recognition technology. One of the major challenges in implementing VOWELS was designing an effective dialog (voice based interface) and search criteria that would allow users to retrieve with ease and flexibility the desired speeches. Throughout the interaction with VOWELS, the voice server plays an intermediary role between the user and the web server. The voice server hosts the VoiceXML document for the initial dialog with the application. All of the other VoiceXML documents interpreted by the voice browser A voice browser is a web browser that presents an interactive voice user interface to the user. In addition, it typically provides an interface to the PSTN or a PBX. Just as a visual web browser works with HTML pages, a voice browser operates on pages that specify voice dialogues. are dynamically generated by the VOWELS engine in the web server. Web server. The web server used in this application is an IBM HTTP Server IBM HTTP Server (IHS) is a web server based on the Apache Software Foundation's Apache HTTP Server that runs on AIX, HP-UX, Linux, Solaris, Windows NT, and z/OS. It is available for download and use IBM HTTP Server free of charge but without IBM support. powered by Apache V1.3.12. The web server takes input from the voice server, makes connection to the database stored on it, fetches the speeches that matches the user's search criteria, and returns the speeches using wave (.wav) file format back to the voice-server. The web server hosts VoiceXML and JSP (JavaServer Page) An extension to the Java servlet technology from Sun that allows HTML to be combined with Java on the same page. The Java provides the processing, and the HTML provides the layout on the Web page. pages. VOWELS Interface VOWELS allows the retrieval of speeches given by famous and infamous figures of the twentieth century. The repository of speeches is limited but varied enough to allow users to exercise the retrieval of speeches based on a variety of search criteria. To start a session with VOWELS the user needs to place a call to Bevocal and be authenticated. The application then greets the user and presents a set of search criteria for the user to choose from. When the user finds the desired speech, VOWELS plays it. The user can listen to as many speeches as s/he wants in a session. A user can find a speech based on the following search criteria: * name of the person who gave the speech (e.g., "Martin Luther King"); * first letter of the last name of the person who gave the speech (e.g., "K"); * place where the speech was delivered (e.g., "Washington, D.C."); * year in which the speech was given (e.g., "1963"); * topic of the speech (e.g., "Civil Rights"); and * a few words from the speech (e.g., "I have a dream"). The scenarios for interacting with VOWELS using the above search criteria can be found in the section, "VOWELS: Usage Scenarios." [FIGURE 3 OMITTED] VOWELS Speech Repository This is a web-based application See Web application. used to maintain the speech repository. The Speech Repository is the collection of speeches given by famous and infamous figures of the twentieth century that can be retrieved using VOWELS. At this point only developers can access it and perform the following functions on the repository: * add new speech records (Figure 3); and * edit/delete existent speech records (Figure 4). The main advantages of using this maintenance application are that it is web accessible and that when updating the speech repository the integrity of the database is kept through appropriate validation checks. Microsoft Access A database program for Windows, available separately or included in the Microsoft Office suite. Access is programmable using Visual Basic for Applications (VBA). Access can read Paradox, dBASE and Btrieve files, and using ODBC, Microsoft SQL Server, SYBASE SQL Server and Oracle data. is the database system used to maintain the repository. [FIGURE 4 OMITTED] VOWELS: USAGE SCENARIOS A sample interaction with VOWELS is depicted in Figure 5 and 6. The call flow diagrams show a decision tree that is followed based on user input. The terminal node terminal node - leaf in the tree is the speech to be retrieved or a message indicating that the search yielded no results. Two typical call-flow scenarios are described next. Scenario 1: In the chart shown in Figures 5 and 6, follow the dialogues: 1.1 - 1.2 - 1.3 - 1.4 - 1.5 - 1.6 - 1.7 - 1.8 - 1.9 The call-flow is defined for the option of giving the Name of the person who gave the speech and then the Place where the speech was given. Name: "Kennedy" Place: "Washington, D.C." At this point VOWELS will inform the user about the speeches in the repository given by Kennedy in Washington, D.C. and their corresponding topics. Subsequently, the user can narrow the search by selecting one of the speeches. Scenario 2: In the chart shown in Figures 5 and 6, follow the dialogues: 2.1 - 2.2 - 2.3 - 2.4 - 2.5 - 2.6 - 2.7 - 2.8 The call-flow is defined for the option of giving a few Words from the speech and then the Name of the person who gave the speech. For example, if the user says, "I have a dream" as a few words of the speech; and then "Martin Luther" as the name of the person who gave the speech. Words: "I have a dream" Name: "Martin Luther" At this point VOWELS identifies the only speech that meets the search criteria and delivers the speech. SUMMARY AND CONCLUSIONS As stated in the W3C VoiceXML specification, VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF (Dual-Tone MultiFrequency) The type of audio signals that are generated when you press the buttons on a touch-tone telephone. See also DMTF. DTMF - Dual Tone Multi Frequency key input, recording of spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications (W3C VoiceXML, 2003). This project leverages the aforementioned characteristics of VoiceXML and prototypes the VOWELS system. This application was tested using a convenience sample and a number of observations can be made. Participants of the pilot test expressed positive comments towards VOWELS in regards to its ease of use and its minimalistic requirements for use (i.e., they only needed a telephone to interact with the application). Furthermore, they were sucessful in retrieving the speeches sought using the dialog and search criteria devised. In addition to the benefits previously described, VOWELS is currently being investigated for use by history students. The ease of retrieving historical speeches by prominent figures using only the telephone can be beneficial to those who do not have any computing background or computing resources. Furthermore, students in information technology/computer science/information systems can also use VOWELS to learn VoiceXML as a burgeoning technology in the computing field. The scope of the VOWELS is of appropriate size for study and the documentation (including all the source code) available is very convenient for students to get started in the VoiceXML technology. Also, VOWELS represents the type of applications that lifelong-learners in general could easily use with minimal background and resources. Lastly, while the use of VOWELS can benefit many users, it should be noted a percentage of the population might encounter some problems interacting with VOWELS and voiced based applications in general. According to according to prep. 1. As stated or indicated by; on the authority of: according to historians. 2. In keeping with: according to instructions. 3. Holly (2001), more than one-quarter of the U.S. residents will experience significantly higher error rates with speech recognition technology. The 74 million people cited can be broken down as follows: * 2 million non-native English * 8 million women with high-pitched voices that recognition software can not understand * 10 million people with accents, speech impediments, or voices that cannot be understood for unknown reasons. * 54 million children whose underdeveloped un·der·de·vel·oped adj. Not adequately or normally developed; immature. oral and nasal cavities nasal cavity n. The cavity on either side of the nasal septum, extending from the nares to the pharynx, and lying between the floor of the cranium and the roof of the mouth. nasal cavity, n See cavity, nasal. produce sounds the software cannot recognize.
Step 1: Dial 1-877-33-BEVOCAL. (Connecting to the voice
server)
Step 2: Dial in Pin number: 8370 (authenticating the
user)
Step 3: Dial in User-ID: 4544313
Step 4: Follow the call-flow as shown in the diagram.
Step 5: Once you reach the last stage, say 'Back' to go
back to the main menu to start the next
call-flow scenario.
References Boyce, S. J. (2000). Natural spoken dialogue systems for telephony. Communications of the ACM (publication) Communications of the ACM - (CACM) A monthly publication by the Association for Computing Machinery sent to all members. CACM is an influential publication that keeps computer science professionals up to date on developments. , 43(9). Hartman, J., & Vila, J. (2001). VoiceXML builder: A tool for creating voiceXML applications, Proceedings of WebNet 2001 World Conference on the WWW and Internet. Orlando, Florida The city of Orlando is a major city in central Florida and is the county seat of Orange County, Florida. According to the 2000 census, the city population was 185,951. A 2006 U.S. , pp. 489-494. Norfolk, VA: Association for the Advancement of Computing in Education. Holly, S. (2001). Speak up. PC Magazine, Ziff Davis Media (Ziff Davis Media Inc., New York, www.ziffdavis.com) A leading integrated media company that serves the computer, videogame and consumer lifestyle markets. Its offerings include PC Magazine and the PCMag.com Network, which includes PCMag.com, ExtremeTech. . Retrieved November 1, 2001, from http HTTP in full HyperText Transfer Protocol Standard application-level protocol used for exchanging files on the World Wide Web. HTTP runs on top of the TCP/IP protocol. ://www.pcmag.com/article2/0,4149,26210,00.asp Lai, J. (2000). Conversational interfaces--introduction. Communications of the ACM, 43(9). Luis, B. (2000). VoiceXML for web-based distributed conversational applications. Communications of the ACM, 43(9). Lucente, M. (2000). Conversational interfaces for e-commerce applications. Communications of the ACM, 43(9). Morrison, M. (2000). XML unleashed. Indianapolis, IN: SAMS SAMS Scottish Association for Marine Science SAMS Space Acceleration Measurement System SAMS South American Missionary Society (of the Episcopal Church, Inc) SAMS School of Advanced Military Studies (US Army) . Oviatt, S. (1999). Ten myths of multimodal interaction. Communications of the ACM, 42(1). Schneiderman, B. (2000). The limits of speech recognition. Communications of the ACM, 43(9). VoiceXML Forum. (n.d.). Retrieved September 1, 2003, from http:// www.voicexml.org/ W3C VoiceXML (n.d.). Retrieved September 1, 2003, from http:// www.w3.org/TR/2001/WD-voicexml20-20011023/ World Wide Web Consortium (n.d.). Retrieved September 1, 2003, from http://www.w3.org/Voice JOAQUIN VILA, BILLY LIM, AND ARCHANA ANAJPURE Illinois State University ISU is recognized in the prestigious US News rankings as a "National University", that is, a university which grants a variety of doctoral degrees and strongly emphasizes research. USA javila@ilstu.edu bllim@ilstu.edu aaanajp@ilstu.edu |
|
||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion