Conversational Voice Recognition With Wizzard Software - page 2
The software does work on a consumer machine, but it is really designed to process offline audio files, in a stout Linux server environment.
Wizzard provided me with a review copy, to give it a try on my Athlon 64 powered HP Pavilion notebook. The machine runs 64-bit SUSE Linux version 10.0. It has 1 GB of memory, an 80 GB disk, and speeds along at a 2.0 gHz. With this setup I was able to process a few of my own audio files. It typically took about a minute to run through a 100-word paragraph.
Installation was straightforward and consisted of copying the file from the CD onto my hard drive. A quick untarring of the wizd_wizzscribe_si.tar file and I was all set to configure.
Next, under the data-input directory, I created a db file, with the vi editor. I also modfied the shared.py file in the english/cfg/ directory. The db holds the utterance names, audio file name, and audio starting/stopping points. The shared.py file tells the program where to find audio files and where to put the results.
Initially, every time I ran the program I got a "root directory not found" error. The problem was eventually solved with the "-r" command line option.
Here's the command line that worked.
rreilly> ../bin/transcribe -m fast -r /home/rreilly/software/wizd/ rob30acc.wav
Note that the program was installed and executed by a normal user. Much of application is written in Python.
I used Audacity and an inexpensive desktop microphone to create several audio clips of spoken text, one of which was the rob30acc.wav file. At first I used the default 48K sample rate and sample format of 32 bit, to capture my voice. Todd Kammerer, an engineer for Wizzard, suggested that I change the settings to 8K for the sample rate and 16 for the bit format.
Of course, my early 48K/32 bit audio files didn't work very well. There was no problem processing the file, but the recognition was poor.
Switching over to the recommended audio settings, proved to work pretty well. With these settings I was able to get about 80% recognition on a 80-100 word paragraph. On the few files that I tried, I didn't notice any real difference in recognition between the "fast" and "accurate" mode options. "Fast" processed speech only slightly faster, on the order of a couple of seconds. Many factors come into play, for speech recognition, so my experience certainly shouldn't be considered an exhaustive test of the program.
Here is a partial listing (some lines were removed for brievity) of the program output to the terminal screen, as it processed speech.
------------------------------------- ATTILA : V1.1 P004 Host : CURLYAMD Date : Thu Aug 31 14:15:50 2006 Process ID : 9725 Arguments : ['../bin/transcribe', '-m', 'fast', '-r', '/home/rreilly/software/wizd/', 'rob30acc.wav'] ------------------------------------- INFO(db) read open /home/rreilly/software/wizd/data.input/db INFO(db) read beg= 0 end= 1 INFO(vocab.cc,60) Vocab::read lexN= 35685 wordN= 32858 INFO(arcgraph.cc,37) ArcGraph::read open /home/rreilly/software/wizd/english/sys/si/si.cvx INFO(arcgraph.cc,48) ArcGraph::read stateN= 12718225 nullN= 857406 arcsN= 61412495 INFO(arcgraph.cc,111) ArcGraph::read connectivity check passed INFO(escore.cc,85) EScorer::read /home/rreilly/software/wizd/english/sys/si/si.fs INFO(main) decoding spk= Rob30acc INFO(FeAudio::computeSpk) read /home/rreilly/software/wizd/data.input/rob30acc.wav INFO(FeNorm::computeSpk) read data.output//si/norm/Rob30acc.mat mode= 0 Rob30acc_0001 score= -73.90434 frameN= 50 Rob30acc_0001 words= Rob30acc_0002 score= -41.60270 frameN= 42 Rob30acc_0002 words= Rob30acc_0003 score= -1143.95764 frameN= 994 Rob30acc_0003 words= IF YOU WANT TO EXPAND YOUR NETWORK DON'T WHEN OR SOFTWARE PENDING CONFERENCES YOU ALSO WANT TO CONSIDER BEING THE PERSON WHO YOURSELF INCORRECT TYPES Rob30acc_0004 score= -1067.03064 frameN= 1031 Rob30acc_0004 words= MAIN RATE THE IDEA SPEAKING TO A GROUP UP THERE THAT'LL WORK IN TERMS OF STRESS BUT MANY WRITERS FIND IT SPEAKING AND WRITING OR NATURAL COMPLIMENTS Rob30acc_0005 score= -1075.77625 frameN= 1024 Rob30acc_0005 words= IF YOU CRAVE ATTENTION LIKE TEACHING PEOPLE NEW SKILLS ORANGE OR I'M JOINING MAKING PEOPLE LAUGHED SPEAKING CAN OFFER A LIQUOR SOPHOMORE Rob30acc_0006 score= -1515.97522 frameN= 1443 Rob30acc_0006 words= SPEAK ENGLISH YOURS IS ABILITY CAN HELP STRENGTHEN YOUR POSITION AS AN EXPERT IN A PARTICULAR FIELD FEEL INSECURE ABOUT HER KNOWLEDGE REMEMBER IT BUT IF YOU'RE WRITING ABOUT A SUBJECT YOU'RE USUALLY PERCEIVED AS AN EXPORT ------------------------------------- Host : CURLYAMD Memory : 92264 [KB] Date : Thu Aug 31 14:17:12 2006
The following is an excerpt from one of the output files, data.output/si/ctm/Rob30acc.ctm:
Rob30acc 0003 5.86 0.24 IF Rob30acc 0003 6.1 0.11 YOU Rob30acc 0003 6.21 0.24 WANT Rob30acc 0003 6.45 0.11 TO Rob30acc 0003 6.56 0.49 EXPAND Rob30acc 0003 7.05 0.11 YOUR Rob30acc 0003 7.16 0.55 NETWORK Rob30acc 0003 7.71 0.25 DON'T Rob30acc 0003 7.96 0.2 WHEN Rob30acc 0003 8.16 0.14 OR Rob30acc 0003 8.3 0.54 SOFTWARE Rob30acc 0003 8.84 0.36 PENDING Rob30acc 0003 9.2 1.55 CONFERENCES Rob30acc 0003 10.78 0.24 YOU Rob30acc 0003 11.02 0.33 ALSO Rob30acc 0003 11.35 0.25 WANT Rob30acc 0003 11.6 0.1 TO Rob30acc 0003 11.7 0.45 CONSIDER Rob30acc 0003 12.15 0.32 BEING Rob30acc 0003 12.47 0.11 THE Rob30acc 0003 12.58 0.37 PERSON Rob30acc 0003 12.95 0.3 WHO Rob30acc 0003 13.33 0.66 YOURSELF Rob30acc 0003 14.44 0.64 INCORRECT Rob30acc 0003 15.08 0.49 TYPES Rob30acc 0004 16.1 0.24 MAIN Rob30acc 0004 16.34 0.3 RATE Rob30acc 0004 16.64 0.17 THE Rob30acc 0004 16.81 0.49 IDEA Rob30acc 0004 17.3 0.64 SPEAKING Rob30acc 0004 17.94 0.17 TO Rob30acc 0004 18.11 0.09 A Rob30acc 0004 18.2 0.38 GROUP
The program creates several files (part of one is shown above) that log the text recognized, along with other data related to the conversion. Scripts could be used to further process the data for insertion or analysis with other software or database applications.
I also recorded some text spoken by my daughter. She spoke into a handheld digital voice recorder, that I use for interviews. The audition was pulled into Audacity over an audio cable between the recorder and my HP/Linux notebook. After a little editing for dead spots, I saved the file in .wav format.
Recognition wasn't very good for my daughter's brief recitation of a page from a story book. I'd be surprised if it topped 5%. On balance, we should keep a few things in mind.
- She has a soft, higher pitched voice.
- There were frequent stops and starts, while reading a sentence.
- Kids don't always read smoothly.
- And, kids definitely don't enunciate consistently.
I had pretty good results when I paid careful attention to speaking clearly and deliberately. I'm sure there are ways to optimise the software, for a higher kid-speech recognition level.