Speaking to an understanding SPHINX.Speaking to an understanding SPHINX sphinx (sfĭngks), mythical beast of ancient Egypt, frequently symbolizing the pharaoh as an incarnation of the sun god Ra. The sphinx was represented in sculpture usually in a recumbent position with the head of a man and the body of a lion, Until now, the award for best performance by a computer able to understand spoken words has gone to systems trained to recognize a particular individual's voice. The SPHINX system, recently developed by graduate student Kai-Fu Lee Kai-Fu Lee (Traditional Chinese: 李開復 Simplified Chinese: 李开复; pinyin: Lǐ Kāifù; b. December 3, 1961) is an information technology executive and a computer science researcher. of Carnegie-Mellon University (CMU CMU - Carnegie Mellon University ) in Pittsburgh, matches that level of performance -- but with a significant difference. It responds to just about any voice. Users don't have to endure the lengthy preliminary process of providing speech samples to ensure the computer can understand their words. Lee describes SPHINX as the world's first "accurate, large-vocabulary, speaker-independent" speech recognition system. Researchers have long felt that speaker-independent systems can't work as well as speaker-dependent ones, says CMU's D. Raj Reddy Dabbala Rajagopal "Raj" Reddy (born June 13, 1937 in Katur, Andhra Pradesh, India) is a world-renowned researcher in Artificial Intelligence, Robotics, and Human-Computer Interaction. , who more than a decade ago pioneered the HEARSAY hearsay: see evidence. speech program. "For the first time, we seem to have at least one [speaker-independent] system doing about the same as or better than a speaker-dependent system." "Lee's achievement is definitely noteworthy," says James R. Baker of Dragon Systems Dragon Systems, Inc., was the company that created DragonDictate and Dragon NaturallySpeaking. It was founded in 1982 by Drs. James and Janet Baker and bought by Lernout & Hauspie in 2000. , Inc., in Newton, Mass., a company specializing in speech-processing technology. "There's no question he has singlehandedly produced a state-of-the-art speech recognition system." Computer-based speech recognition is already well established. Systems that recognize single words or simple phrases from a limited vocabulary are used in factory settings for controlling machinery, checking inventory, entering data and inspecting parts on an assembly line. In some hospitals, to keep their hands free for working with critically ill patients, nurses wear microphones so that they can describe their actions and observations to a computer that logs the information and keeps the necessary records. Lee's system, with a 997-word vocabulary, is designed so Pentagon planners, simply by asking the appropriate questions, can speedily search for and find specific information stored in a database. In the SPHINX system, sound waves captured by a microphone are first converted into strings of digits. Further processing reduces the number of digits necessary to represent the information contained in a waveform The shape of a signal. See wavelength, sine wave and square wave. so that a single number characterizes 10 milliseconds of speech. Doing the signal processing See DSP. in three different ways produces three such numbers for each speech segment. The system is programmed to look for patterns among these sets of numbers,and in a short time, it produces its best guess as to what the spoken words were. The SPHINX system's superior performance can be attributed largely to a sophisticated computer program combining a powerful mathematical technique, known as hidden Markov modeling A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. , with carefully formulated principles derived from human knowledge about speech, Reddy says. Furthermore, by analyzing large collections of speech samples from many different people, SPHINX can automatically fine-tune its capabilities. "The [learning] technique we're using happens to be fairly sensitive to the amount of training," says Lee. "The more you train it, the better it gets." About 10 hours of training on 4,200 sentences spoken by different people is enough to let almost anyone thereafter use the system with good results. Lee's achievement raises the intriguing in·trigue n. 1. a. A secret or underhand scheme; a plot. b. The practice of or involvement in such schemes. 2. A clandestine love affair. v. question of whether the techniques he used would also enhance the performance of speaker-dependent systems. "We may find that if other people use all of the new techniques that Kai-Fu Lee introducued in his sytem, speaker-dependent systems might do better too," says Reddy. That possibility has yet to be tested. Another possibility is a system that combines the best qualities of a speaker-independent system like SPHINX with the learning capabilities of a speaker-dependent system. Such a hybrid system A hybrid system is a dynamic system that exhibits both continuous and discrete dynamic behavior — a system that can both flow (described by a differential equation) and jump (described by a difference equation). would have a considerable store of knowledge about the human voice and the capacity for automatically adapting to a new speaker. After only a brief period training the system, the user would be ready to use it. "A generalized gen·er·al·ized adj. 1. Involving an entire organ, as when an epileptic seizure involves all parts of the brain. 2. Not specifically adapted to a particular environment or function; not specialized. 3. model will never match your voice as well as a model derived from your own voice," says Janet M. Baker, Dragon Systems president. "As soon as you get specific information about a given speaker's voice, you want to make use of that. That's how you'll get the best performance." Lee himself has studied the effect of adding procedures to SPHINX allowing the system to tailor its operation to a particular speaker. However, he found that with his methods the improvement in performance was disappointingly small. He attributes the small gain to the fact that the speaker-independent version of SPHINX is already quite accurate. Lee wants to improve his SPHINX system so that it can handle a larger vocabulary and more complex relationships among words. So far, the system has dealt with only simple grammars. He's also considering alternative ways of adding human knowledge about words and sounds to his system. "At this point, it's clear that one can get speech recognition at a respectable level of performance without requiring any training for individual speakers, and that's very encouraging," says Reddy. "The challenge is to get human-like performance. We're still an order of magnitude A change in quantity or volume as measured by the decimal point. For example, from tens to hundreds is one order of magnitude. Tens to thousands is two orders of magnitude; tens to millions is three orders of magnitude, etc. away from that." |
|
||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion