February 23, 2019

Penguins Like Salmon, Too - page 2

Springtime Brings New Community Events

  • May 10, 2004
  • By Dee-Ann LeBlanc

In a stainless steel-filled room designed for training cooks, Elaine Tsiang began her discussion by putting speech recognition in context. One of the attendees asked her what the point was for this technology, as he didn't really see much of a need for it.

Her answer was that not only do those unable to type for physical or other reasons really value speech recognition, so do those in fields where typing skills aren't nearly as important as the ability to report data for sharing among specialists (doctors, surgeons, and radiologists all typically dictate reports that must be sent to professional transcription services to be typed up), along with people who speak non-alphabetic languages such as Chinese, which have so many characters that it is virtually impossible to create a simple keyboard to handle them.

She then went into the fundamentals of sound and speech recognition technologies. Often, in the world of computers, we're interested in the most we can squeeze out of whatever we're doing. When it comes to creating and editing sound, we go for the highest sampling rates possible in order to produce output that's excitingly clear.

Yet, when it comes to speech, you actually want to rein in your sound system and work below its capabilities. We're only interested, when it comes to speech recognition, in around one percent of the entire signal, the rest of it is background noise and other distractions.

The hard part is pulling out the correct one percent. Our brains are designed to do this automatically, and Tsiang is convinced that the incredibly complex arrangement that processes human hearing is used in large part simply for the purpose of filtering out the good from the bad.

Considering the number of children or adults with auditory processing problems that make it difficult for them to ignore a plane flying overhead while trying to listen to a conversation, or other such distractions that many of us might not even notice, she may very well be right.

Rather than trying to analyze every tiny sound a computer finds in a recording, today's speech recognition in a way reverse engineers to match what it's picking up with databases of sounds and words. Today, these specialized libraries cost twenty thousand dollars and more.

Tsiang feels that this is an area that open source can handily address, helping to assemble freely available term libraries along with the many specialized bits and pieces of code that have to handle the various problems and functions in any fully featured speech recognition solution.

In order to figure out which word is being used, modern speech recognition models work with probabilities. Hidden Markov Models (HMMs) are utilized over very short terms, to try to guess after each sound what is likely to follow. So if someone starts by saying "ah," the software begins searching its databases for every word that begins with this sound, building a list of words that begin with that particular sound or words that are likely to follow.

Sentences, grammar, and other such broader issues are completely ignored, as the software merely tries to figure out what is being said. The libraries Tsiang was referring to are called "seed models," as they're the base that each speech recognition program starts with before training. Seed models are jealously guarded as trade secrets by proprietary vendors.

As far as the available solutions today, many of the programs that were available in 2002, such as IBM's ViaVoice and the components such as xvoice, gvoice, and kvoice that depended on it, are abandoned. Interest in this technology appears to be picking up again, however, with Sphinx being reborn at Carnagie Mellon University and the Open Source Speech Recognition Initiative developing tools such as windictator--which is currently a Linux server using a Windows client--hopefully we'll have more options under Linux soon.

Most Popular LinuxPlanet Stories