Vowels: a voice-based Web engine for locating speeches.
The most common interaction with the Web is through the ubiquitous browser using visual interfaces. Most search engines use traditional visual interfaces to interact with its users. However, for certain applications and users it is desirable to have other modes of interaction with the Web. Current interfaces limit the mobility of the user and his/her interaction with the Web because both hands and eyes must be involved in the task. Speech technology will promote an increased use of the Web in untapped environments in a similar way that cell telephones have promoted the increased use of telephones. One of the limitations of the development of voice systems was the lack of standards for creating spoken dialogue systems. A promising emerging technology for solving this problem is VoiceXML. VoiceXML brings the Web and content delivery together in voice response applications in an easy-to-use manner. This article presents a voice-based search engine implemented in VoiceXML and narrates the design and development issues behind the application.
Currently, the most common interaction with the Web is visual and accomplished through the use of the keyboard or mouse. While sound files can be incorporated as part of the presentation, the user rarely interacts with a web page using speech. This orientation limits the mobility of the user and his/her interaction with the Web because both hands and eyes must be involved in a given task. In fact, most pervasive computing devices today are used in a hands and eyes busy mode. The use of speech recognition and synthesis will remove this limitation and promises to provide more flexibility in the design and development of web interfaces (Hartman & Vila, 2001.
The use of speech technology frees the user from the windows, icons, menus, and pointers (WIMP) interface (Boyce, 2000; Lucas, 2000; Schneiderman, 2000). This technology enables users to interact with the Web using voice commands. Therefore, if users do not have a computer connected to the Internet on hand, they could use a telephone instead to interact with the Web. Furthermore, this mode of interaction makes it more accessible for certain type of users (e.g., visually impaired). Speech technology will promote an increased use of the Web in, as yet, untapped environments in a similar way that cell telephones have promoted the increased use of telephones. Many of the speech interfaces today are similar to telephone response systems in which the user is expected to enter a preset number from a menu of choices. While these systems are common place today, they are viewed as limited because the user must remember the mapping to keys, there may not be an appropriate option, or navigation must proceed through a prescribed set of options, even if the users know exactly what they want to do (Boyce, 2000).
A literature review reveals that researchers are working diligently to understand speech interaction and when its use is most successful. Several articles have reported on studies of the voice interaction of humans with computers and how to make speech interfaces more conversational and adaptive than voice response systems (Boyce, 2000; Lai, 2000; Lucente, 2000). Boyce investigated the use of key word driven and spoken dialogue systems and found that spoken dialogue systems were more flexible, but more complex to design. She also found that users preferred to interact with systems that referred to themselves as "I," but found no significant differences in the preference for casual versus formal speaking style. Furthermore, the research indicated that the right initial system greeting is essential for establishing user expectations and helping users determine how to proceed (Boyce, 2000).
Schneiderman investigated the limitations of current speech interfaces, particularly the interaction of speech and physical activity (e.g., keyboard manipulation) in interfaces. He found that most humans find it easier to type and think concurrently than to speak and think concurrently. Thus, voice command users needed to review their work more often than keyboard users in a word processing environment. He concluded that an understanding of the cognitive processes used in speech will aid interface designers in integrating speech in a more effective manner. Schneiderman also indicated that future uses of speech in web environments will not be as standalone components, but as complements to visual interfaces as part of a multimodal interface (Schneiderman, 2000).
Multimodal interaction, which includes speech, is part of a paradigm shift away from the use of WIMP interfaces. These systems have the potential for functioning more robustly than a single recognition-based technology such as speech. The design of these systems requires knowledge of the properties of the individual modes, such as speech, and the information content that accompanies them (Oviatt, 1999).
Recognizing the potential of speech interfaces and the opportunities that it offers, a Voice-Based Web Engine for Locating Speeches (VOWELS) was developed as an alternative method to retrieve speeches from a media repository. This application allows users to search speeches using voice commands and dialogs over the telephone. The search space of VOWELS was restricted to speeches of prominent figures of the twentieth century. For example, the speech repository includes speeches such as "I have a dream ..." by Dr. King, "A date which will live in infamy" by President Franklin D. Roosevelt, "... ask not what your country can do for you; ask what you can do for your country ..." by President John F. Kennedy, and "That's one small step for man; one giant leap for mankind ..." by astronaut Neil Armstrong. VOWELS was implemented using VoiceXML, one of the most promising emerging technologies for implementing voice web applications.
The remainder of the article is organized as follows: (a) describes VoiceXML as one of the emerging technologies; (b) the architecture of VOWELS; (c) two scenarios of the use of VOWELS; and finally, (d) the summary and conclusions.
VOICE EXTENSIBLE MARKUP LANGUAGE (VOICEXML)
Voice eXtensible Markup Language (Voice XML) is a language developed and promoted by the VoiceXML Forum founded in 1999 by AT&T, IBM, Lucent, and Motorola (VoiceXML Forum, 2003). The VoiceXML 1.0 Specification was completed in March of 2000 and accepted by the World Wide Web Consortium (W3C) in May of 2000 (World Wide Web Consortium, 2003). The Forum now has over 370 member companies who are actively involved in using and promoting this new technology. VoiceXML is a W3C Recommendation that allows a web server to deliver voice dialogues to users over the phone by way of a voice server. It produces a bridge between computer telephony technology and web server technology. This bridge makes the Voice Web possible.
VoiceXML is an eXtensible Markup Language (XML) that brings the Web, content delivery, and voice response applications together in an easy-to-use manner. XML (eXtensible Markup Language) is a specification for designing markup languages that are used in the Web and is an accepted standard for providing structure to web documents. It specifies a standardized text format for representing structured information on the Web. XML documents consist of data and markup components (e.g., element tags, processing instructions, data elements, comments, etc.) that are parsed and interpreted by an XML processor. Synchronized Multimedia Integration Language (SMIL) and VoiceXML are just two of many XML vocabularies that have been developed.
A VoiceXML document defines a voice dialogue between a user and the system. The system can speak phrases or sentences to the user, either from prerecorded voice files or as output generated in real time using text-to-speech (TTS) synthesis. User input can be spoken words recognized by Automatic Speech Recognition (ASR) technologies. The VoiceXML dialogue specifies what action to take based on user input.
A VoiceXML dialogue may present a simple menu of choices for the user to select from, or it may be a more thorough type of interaction using forms. For example, a voice form might allow a user to fill in the fields for the name of a person giving the speech and/or the location where the speech was given to interact with VOWELS. Once a VoiceXML form has been filled out, the data is submitted to the web server in the same way data from an HTML form would be submitted. Then a server-side web application can use the submitted data to perform any kind of transactions (e.g., query to a database and return the results to the voice web browser as a new VoiceXML document.)
[FIGURE 1 OMITTED]
A sample VoiceXML document is presented in Figure 1.
VOWELS: A VOICE-BASED WEB ENGINE FOR LOCATING SPEECHES
The application described in this article is a Voice-Based Web Engine for Locating Speeches (VOWELS). The search space is restricted to selected speeches of famous and infamous figures of the twentieth century. The application is driven by a telephone as a front-end interface. The user gives in his/her choice of criteria for speech-search and accordingly retrieves the speech from a voice repository. The motivation behind the development of VOWELS was twofold--as a proof-of-concept as well as a research tool to investigate the usability of voice web interfaces.
[FIGURE 2 OMITTED]
The VOWELS architecture is comprised of the following components:
* Touch-tone phone / End-User
* Voice Server (VoiceXML Gateway)
Touch-tone phone/end user. To use VOWELS a user has to make a call using preferably a touch-tone phone. Once the user is authenticated a VOWELS search can be performed.
Voice server. The Voice Server used in this application is hosted by www.bevocal.com. Bevocal is a free voice server that supports Voice XML. Users gain access to VOWELS through a voice server with a toll-free call. The interaction with the application is voice-based and in the form of a dialog. VOWELS presents a set of alternatives to the user using text-to-speech synthesis and the user responds using voice or a keypad. The words spoken by the user are recognized by the voice server using automated speech recognition technology. One of the major challenges in implementing VOWELS was designing an effective dialog (voice based interface) and search criteria that would allow users to retrieve with ease and flexibility the desired speeches. Throughout the interaction with VOWELS, the voice server plays an intermediary role between the user and the web server. The voice server hosts the VoiceXML document for the initial dialog with the application. All of the other VoiceXML documents interpreted by the voice browser are dynamically generated by the VOWELS engine in the web server.
Web server. The web server used in this application is an IBM HTTP Server powered by Apache V1.3.12. The web server takes input from the voice server, makes connection to the database stored on it, fetches the speeches that matches the user's search criteria, and returns the speeches using wave (.wav) file format back to the voice-server. The web server hosts VoiceXML and JSP pages.
VOWELS allows the retrieval of speeches given by famous and infamous figures of the twentieth century. The repository of speeches is limited but varied enough to allow users to exercise the retrieval of speeches based on a variety of search criteria. To start a session with VOWELS the user needs to place a call to Bevocal and be authenticated. The application then greets the user and presents a set of search criteria for the user to choose from. When the user finds the desired speech, VOWELS plays it. The user can listen to as many speeches as s/he wants in a session.
A user can find a speech based on the following search criteria:
* name of the person who gave the speech (e.g., "Martin Luther King");
* first letter of the last name of the person who gave the speech (e.g., "K");
* place where the speech was delivered (e.g., "Washington, D.C.");
* year in which the speech was given (e.g., "1963");
* topic of the speech (e.g., "Civil Rights"); and
* a few words from the speech (e.g., "I have a dream").
The scenarios for interacting with VOWELS using the above search criteria can be found in the section, "VOWELS: Usage Scenarios."
[FIGURE 3 OMITTED]
VOWELS Speech Repository
This is a web-based application used to maintain the speech repository. The Speech Repository is the collection of speeches given by famous and infamous figures of the twentieth century that can be retrieved using VOWELS. At this point only developers can access it and perform the following functions on the repository:
* add new speech records (Figure 3); and
* edit/delete existent speech records (Figure 4).
The main advantages of using this maintenance application are that it is web accessible and that when updating the speech repository the integrity of the database is kept through appropriate validation checks. Microsoft Access is the database system used to maintain the repository.
[FIGURE 4 OMITTED]
VOWELS: USAGE SCENARIOS
A sample interaction with VOWELS is depicted in Figure 5 and 6. The call flow diagrams show a decision tree that is followed based on user input. The terminal node in the tree is the speech to be retrieved or a message indicating that the search yielded no results.
Two typical call-flow scenarios are described next.
In the chart shown in Figures 5 and 6, follow the dialogues:
1.1 - 1.2 - 1.3 - 1.4 - 1.5 - 1.6 - 1.7 - 1.8 - 1.9
The call-flow is defined for the option of giving the Name of the person who gave the speech and then the Place where the speech was given.
Place: "Washington, D.C."
At this point VOWELS will inform the user about the speeches in the repository given by Kennedy in Washington, D.C. and their corresponding topics. Subsequently, the user can narrow the search by selecting one of the speeches.
In the chart shown in Figures 5 and 6, follow the dialogues:
2.1 - 2.2 - 2.3 - 2.4 - 2.5 - 2.6 - 2.7 - 2.8
The call-flow is defined for the option of giving a few Words from the speech and then the Name of the person who gave the speech. For example, if the user says, "I have a dream" as a few words of the speech; and then "Martin Luther" as the name of the person who gave the speech.
Words: "I have a dream"
Name: "Martin Luther"
At this point VOWELS identifies the only speech that meets the search criteria and delivers the speech.
SUMMARY AND CONCLUSIONS
As stated in the W3C VoiceXML specification, VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications (W3C VoiceXML, 2003).
This project leverages the aforementioned characteristics of VoiceXML and prototypes the VOWELS system.
This application was tested using a convenience sample and a number of observations can be made. Participants of the pilot test expressed positive comments towards VOWELS in regards to its ease of use and its minimalistic requirements for use (i.e., they only needed a telephone to interact with the application). Furthermore, they were sucessful in retrieving the speeches sought using the dialog and search criteria devised.
In addition to the benefits previously described, VOWELS is currently being investigated for use by history students. The ease of retrieving historical speeches by prominent figures using only the telephone can be beneficial to those who do not have any computing background or computing resources. Furthermore, students in information technology/computer science/information systems can also use VOWELS to learn VoiceXML as a burgeoning technology in the computing field. The scope of the VOWELS is of appropriate size for study and the documentation (including all the source code) available is very convenient for students to get started in the VoiceXML technology. Also, VOWELS represents the type of applications that lifelong-learners in general could easily use with minimal background and resources.
Lastly, while the use of VOWELS can benefit many users, it should be noted a percentage of the population might encounter some problems interacting with VOWELS and voiced based applications in general. According to Holly (2001), more than one-quarter of the U.S. residents will experience significantly higher error rates with speech recognition technology. The 74 million people cited can be broken down as follows:
* 2 million non-native English
* 8 million women with high-pitched voices that recognition software can not understand
* 10 million people with accents, speech impediments, or voices that cannot be understood for unknown reasons.
* 54 million children whose underdeveloped oral and nasal cavities produce sounds the software cannot recognize.
Step 1: Dial 1-877-33-BEVOCAL. (Connecting to the voice server) Step 2: Dial in Pin number: 8370 (authenticating the user) Step 3: Dial in User-ID: 4544313 Step 4: Follow the call-flow as shown in the diagram. Step 5: Once you reach the last stage, say 'Back' to go back to the main menu to start the next call-flow scenario.
Boyce, S. J. (2000). Natural spoken dialogue systems for telephony. Communications of the ACM, 43(9).
Hartman, J., & Vila, J. (2001). VoiceXML builder: A tool for creating voiceXML applications, Proceedings of WebNet 2001 World Conference on the WWW and Internet. Orlando, Florida, pp. 489-494. Norfolk, VA: Association for the Advancement of Computing in Education.
Holly, S. (2001). Speak up. PC Magazine, Ziff Davis Media. Retrieved November 1, 2001, from http://www.pcmag.com/article2/0,4149,26210,00.asp
Lai, J. (2000). Conversational interfaces--introduction. Communications of the ACM, 43(9).
Luis, B. (2000). VoiceXML for web-based distributed conversational applications. Communications of the ACM, 43(9).
Lucente, M. (2000). Conversational interfaces for e-commerce applications. Communications of the ACM, 43(9).
Morrison, M. (2000). XML unleashed. Indianapolis, IN: SAMS.
Oviatt, S. (1999). Ten myths of multimodal interaction. Communications of the ACM, 42(1).
Schneiderman, B. (2000). The limits of speech recognition. Communications of the ACM, 43(9).
VoiceXML Forum. (n.d.). Retrieved September 1, 2003, from http:// www.voicexml.org/
W3C VoiceXML (n.d.). Retrieved September 1, 2003, from http:// www.w3.org/TR/2001/WD-voicexml20-20011023/
World Wide Web Consortium (n.d.). Retrieved September 1, 2003, from http://www.w3.org/Voice
JOAQUIN VILA, BILLY LIM, AND ARCHANA ANAJPURE
Illinois State University
|Printer friendly Cite/link Email Feedback|
|Publication:||Journal of Educational Multimedia and Hypermedia|
|Date:||Jun 22, 2004|
|Previous Article:||Information and Communication Technologies (ICT) and pupils with Attention Deficit Hyperactivity Disorder (ADHD) symptoms: do the software and the...|
|Next Article:||Emotional responses to computers: experiences in unfairness, anger, and spite (1).|