Monday, September 17, 2012

Speech Recognition – Setting up sclite word alignment

Word alignment is used to measure the accuracy of a decoder.  Sphinx tutorial references sclite from National Institute of Standards and Technology (NIST). This time I’m going to share some notes on how to run and setup sclite.

Running sclite

Sclite is a tool for scoring and evaluating the output of speech recognition by comparing the hypothesized text (HYP) output by the speech recognizer to the correct, or reference (REF) text. After comparing REF to HYP, (a process called alignment), statistics are gathered during the scoring process and a variety of reports can be produced to summarize the performance of the recognition system.

This is an example output using the files located src\sclite\testdata\tests.hyp and sclite\testdata\tests.ref:

  • The '-h' option is a required argument which specifies the input hypothesis file.
  • The '-r' option, a required argument, specifies the input reference file which the hypothesis file(s) are compared to.

sclite -h C:\Project\SpeechRecognition\CMUSphinx\3rdPartyLibs\sctk-2.4.3\src\sclite\testdata\tests.hyp -r C:\Project\SpeechRecognition\CMUSphinx\3rdPartyLibs\sctk-2.4.3\src\sclite\testdata\tests.ref

New Picture (1)

Setup  sctk-2.4.0-20091110-0958.tar.bz2 on Windows 7

I downloaded Speech Recognition Scoring Toolkit (SCTK) which includes the SCLITE, ASCLITE, tranfilt, hubscr, SLATreport and utf_filt scoring tools.

I could compile it with gcc version 3.4.4, found in the following MinGW setup package  mingw-get-inst-20101030.exe.

It was also necessary to modify 'src/rfilter1/' and change the value of OPTIONS to be blank  (as specified in the instructions)

The following compilation error is thrown when compiling using gcc version 4.6.2:

recording.h:122:29: error: 'Filter::Filter' cannot appear in a constant-expression
recording.h:122:36: error: template argument 2 is invalid
recording.h:122:36: error: template argument 4 is invalid
make[3]: *** [main.o] Error 1

Setup sctk-2.4.2-20120810-0938.tar.bz2 on Windows 7

Something similar happened with this version, I could compile it with gcc version 3.4.4.

The following compilation error is thrown when using gcc version 4.6.2:

In file included from asctools.h:23:0,
                 from asctools.cpp:22:
timeval.h:33:8: error: redefinition of 'struct timeval'
c:\mingw\bin\../lib/gcc/mingw32/4.6.2/../../../../include/winsock2.h:109:8: error: previous definition of 'struct timeval'
make[2]: *** [asctools.o] Error 1
make[2]: Leaving directory `/c/Project/SpeechRecognition/CMUSphinx/3rdPartyLibs/
make[1]: *** [all] Error 2
make[1]: Leaving directory `/c/Project/SpeechRecognition/CMUSphinx/3rdPartyLibs/
make: *** [all] Error 2

This is thrown when compiling rfilter1

gcc  -o rfilter1 rfilter1.c
C:\Users\MANGEL~1\AppData\Local\Temp\ccuj1MU6.o:rfilter1.c:(.text+0x760): undefined reference to `strncmpi'
C:\Users\MANGEL~1\AppData\Local\Temp\ccuj1MU6.o:rfilter1.c:(.text+0x7c4): undefined reference to `strncmpi'
C:\Users\MANGEL~1\AppData\Local\Temp\ccuj1MU6.o:rfilter1.c:(.text+0x827): undefined reference to `strncmpi'
C:\Users\MANGEL~1\AppData\Local\Temp\ccuj1MU6.o:rfilter1.c:(.text+0x935): undefined reference to `strncmpi'
collect2: ld returned 1 exit status
make[1]: *** [rfilter1] Error 1
make[1]: Leaving directory `/c/Project/SpeechRecognition/CMUSphinx/3rdPartyLibs/sctk-2.4.2/src/rfilter1'
make: *** [all] Error 2

It is also possible to compile this with gcc version 4.6.2 after removing asclite tests and rfilter1 from make file.

After finishing the setup you are able to run sclite as described initially.


Sunday, September 9, 2012

Speech Recognition with PocketSphinx

The following post puts together information from different sources about speech recognition and gives a brief overview of the CMUSphinx project and how to get started with PocketSphinx on Windows.

CMUSphinx project is an open source speech recognition project developed at Carnegie Mellon University, which consists of various tools use to build speech applications:
  • CMUclmtk — language model tools
  • Sphinxtrain — acoustic model training tools

The following recognizers (decoders)
  • Pocketsphinx: Designed to be fast and for real time speed written in C, supports desktop applications and mobile devices. It  needs the library Sphinxbase.
  • Sphinx3: Speed recognizer intended for researchers .
  • Sphinx4: Speech recognition written in the Java.

Let’s learn how HMM based speech recognition is handled: it functions by first learning the characteristics (or parameters) of a set of sound units, and then using what it has learned about the units to find the most probable sequence of sound units for a given speech signal. The process of learning about the sound units is called training. The process of using the knowledge acquired to deduce the most probable sequence of units in a given signal is called decoding, or simply recognition.

Setup Pocketsphinx  on  windows

Environment: Windows 7 and Visual Studio 2012, sphinxbase-0.7, pocketsphinx-0.7

Name the folders (sphinxbase / pocketsphinx ), the project pocketsphinx has external dependencies that use the relative paths like the following  “..\..\..\sphinxbase\include\sphinxbase\ad.h”.

To test the installation let's run pocketsphinx_continuous.exe, this tool runs speech recognition both continuous listening from microphone and continuous file transcription. To run it requires:
  • Copy sphinxbase.dll to the build folder, for example C:\Project\SpeechRecognition\CMUSphinx\pocketsphinx\bin\Debug.
  • The parameter –hmm, the directory containing acoustic model files.
  • The parameter –lm, word trigram language model input file.
  • The parameter –dict,  main pronunciation dictionary (lexicon) input file.

This is running the command with the information contained in the project.

pocketsphinx_continuous.exe -hmm C:\Project\SpeechRecognition\CMUSphinx\pocketsphinx\model\hmm\en_US\hub4wsj_sc_8k 
-dict C:\Project\SpeechRecognition\CMUSphinx\pocketsphinx\model\lm\en_US\cmu07a.dic 
-lm C:\Project\SpeechRecognition\CMUSphinx\pocketsphinx\model\lm\en_US\wsj0vp.5000.DMP

this is the output of me saying “no no no”

New Picture (2)

I will take a closer look at the project to check more the accuracy of the recognition.


language model assigns a probability to a sequence of m words P(w1, .., w1) by means of a probability distribution. Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval. In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.

HMM: (Hidden_Markov_model)  is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states.