October 24, 2014
 
 
RSSRSS feed

Conversational Voice Recognition With Wizzard Software - page 2

Hearing Voices

  • September 27, 2006
  • By Rob Reilly

The software does work on a consumer machine, but it is really designed to process offline audio files, in a stout Linux server environment.

Wizzard provided me with a review copy, to give it a try on my Athlon 64 powered HP Pavilion notebook. The machine runs 64-bit SUSE Linux version 10.0. It has 1 GB of memory, an 80 GB disk, and speeds along at a 2.0 gHz. With this setup I was able to process a few of my own audio files. It typically took about a minute to run through a 100-word paragraph.

Installation was straightforward and consisted of copying the file from the CD onto my hard drive. A quick untarring of the wizd_wizzscribe_si.tar file and I was all set to configure.

Next, under the data-input directory, I created a db file, with the vi editor. I also modfied the shared.py file in the english/cfg/ directory. The db holds the utterance names, audio file name, and audio starting/stopping points. The shared.py file tells the program where to find audio files and where to put the results.

Initially, every time I ran the program I got a "root directory not found" error. The problem was eventually solved with the "-r" command line option.

Here's the command line that worked.

rreilly> ../bin/transcribe -m fast -r /home/rreilly/software/wizd/ rob30acc.wav

Note that the program was installed and executed by a normal user. Much of application is written in Python.

I used Audacity and an inexpensive desktop microphone to create several audio clips of spoken text, one of which was the rob30acc.wav file. At first I used the default 48K sample rate and sample format of 32 bit, to capture my voice. Todd Kammerer, an engineer for Wizzard, suggested that I change the settings to 8K for the sample rate and 16 for the bit format.

Of course, my early 48K/32 bit audio files didn't work very well. There was no problem processing the file, but the recognition was poor.

Switching over to the recommended audio settings, proved to work pretty well. With these settings I was able to get about 80% recognition on a 80-100 word paragraph. On the few files that I tried, I didn't notice any real difference in recognition between the "fast" and "accurate" mode options. "Fast" processed speech only slightly faster, on the order of a couple of seconds. Many factors come into play, for speech recognition, so my experience certainly shouldn't be considered an exhaustive test of the program.

Here is a partial listing (some lines were removed for brievity) of the program output to the terminal screen, as it processed speech.

-------------------------------------
ATTILA     : V1.1 P004
Host       : CURLYAMD
Date       : Thu Aug 31 14:15:50 2006
Process ID : 9725
Arguments  : ['../bin/transcribe', '-m', 'fast', '-r', '/home/rreilly/software/wizd/', 'rob30acc.wav']
-------------------------------------

INFO(db)      read      open /home/rreilly/software/wizd/data.input/db
INFO(db)      read      beg= 0 end= 1
INFO(vocab.cc,60)      Vocab::read      lexN= 35685 wordN= 32858
INFO(arcgraph.cc,37)      ArcGraph::read      open /home/rreilly/software/wizd/english/sys/si/si.cvx
INFO(arcgraph.cc,48)      ArcGraph::read      stateN= 12718225 nullN= 857406 arcsN= 61412495
INFO(arcgraph.cc,111)      ArcGraph::read      connectivity check passed
INFO(escore.cc,85)      EScorer::read      /home/rreilly/software/wizd/english/sys/si/si.fs
INFO(main)      decoding      spk=   Rob30acc
INFO(FeAudio::computeSpk) read      /home/rreilly/software/wizd/data.input/rob30acc.wav
INFO(FeNorm::computeSpk) read      data.output//si/norm/Rob30acc.mat mode= 0
Rob30acc_0001 score= -73.90434 frameN= 50
Rob30acc_0001 words=
Rob30acc_0002 score= -41.60270 frameN= 42
Rob30acc_0002 words=
Rob30acc_0003 score= -1143.95764 frameN= 994
Rob30acc_0003 words= IF YOU WANT TO EXPAND YOUR NETWORK DON'T WHEN OR SOFTWARE PENDING CONFERENCES YOU ALSO WANT TO CONSIDER BEING THE PERSON WHO YOURSELF INCORRECT TYPES
Rob30acc_0004 score= -1067.03064 frameN= 1031
Rob30acc_0004 words= MAIN RATE THE IDEA SPEAKING TO A GROUP UP THERE THAT'LL WORK IN TERMS OF STRESS BUT MANY WRITERS FIND IT SPEAKING AND WRITING OR NATURAL COMPLIMENTS
Rob30acc_0005 score= -1075.77625 frameN= 1024
Rob30acc_0005 words= IF YOU CRAVE ATTENTION LIKE TEACHING PEOPLE NEW SKILLS ORANGE OR I'M JOINING MAKING PEOPLE LAUGHED SPEAKING CAN OFFER A LIQUOR SOPHOMORE
Rob30acc_0006 score= -1515.97522 frameN= 1443
Rob30acc_0006 words= SPEAK ENGLISH YOURS IS ABILITY CAN HELP STRENGTHEN YOUR POSITION AS AN EXPERT IN A PARTICULAR FIELD FEEL INSECURE ABOUT HER KNOWLEDGE REMEMBER IT 
                     BUT IF YOU'RE WRITING ABOUT A SUBJECT YOU'RE USUALLY PERCEIVED AS AN EXPORT

-------------------------------------
Host       : CURLYAMD
Memory     : 92264 [KB]
Date       : Thu Aug 31 14:17:12 2006

The following is an excerpt from one of the output files, data.output/si/ctm/Rob30acc.ctm:

Rob30acc 0003 5.86  0.24  IF
Rob30acc 0003 6.1   0.11  YOU
Rob30acc 0003 6.21  0.24  WANT
Rob30acc 0003 6.45  0.11  TO
Rob30acc 0003 6.56  0.49  EXPAND
Rob30acc 0003 7.05  0.11  YOUR
Rob30acc 0003 7.16  0.55  NETWORK
Rob30acc 0003 7.71  0.25  DON'T
Rob30acc 0003 7.96  0.2   WHEN
Rob30acc 0003 8.16  0.14  OR
Rob30acc 0003 8.3   0.54  SOFTWARE
Rob30acc 0003 8.84  0.36  PENDING
Rob30acc 0003 9.2   1.55  CONFERENCES
Rob30acc 0003 10.78 0.24  YOU
Rob30acc 0003 11.02 0.33  ALSO
Rob30acc 0003 11.35 0.25  WANT
Rob30acc 0003 11.6  0.1   TO
Rob30acc 0003 11.7  0.45  CONSIDER
Rob30acc 0003 12.15 0.32  BEING
Rob30acc 0003 12.47 0.11  THE
Rob30acc 0003 12.58 0.37  PERSON
Rob30acc 0003 12.95 0.3   WHO
Rob30acc 0003 13.33 0.66  YOURSELF
Rob30acc 0003 14.44 0.64  INCORRECT
Rob30acc 0003 15.08 0.49  TYPES
Rob30acc 0004 16.1  0.24  MAIN
Rob30acc 0004 16.34 0.3   RATE
Rob30acc 0004 16.64 0.17  THE
Rob30acc 0004 16.81 0.49  IDEA
Rob30acc 0004 17.3  0.64  SPEAKING
Rob30acc 0004 17.94 0.17  TO
Rob30acc 0004 18.11 0.09  A
Rob30acc 0004 18.2  0.38  GROUP

The program creates several files (part of one is shown above) that log the text recognized, along with other data related to the conversion. Scripts could be used to further process the data for insertion or analysis with other software or database applications.

I also recorded some text spoken by my daughter. She spoke into a handheld digital voice recorder, that I use for interviews. The audition was pulled into Audacity over an audio cable between the recorder and my HP/Linux notebook. After a little editing for dead spots, I saved the file in .wav format.

Recognition wasn't very good for my daughter's brief recitation of a page from a story book. I'd be surprised if it topped 5%. On balance, we should keep a few things in mind.

  • She has a soft, higher pitched voice.
  • There were frequent stops and starts, while reading a sentence.
  • Kids don't always read smoothly.
  • And, kids definitely don't enunciate consistently.

I had pretty good results when I paid careful attention to speaking clearly and deliberately. I'm sure there are ways to optimise the software, for a higher kid-speech recognition level.

Sitemap | Contact Us