Tuesday, 26 April 2016

algorithms - Using MFCC in spoken words recognition


We're trying to implement a "simple" speech recognition application in MATLAB (isolated words from a very limited dictionary). We've been trying the following methods:



  1. Extract MFCC coefficients for each frame of the word, and then compare to templates that we have recorded previously using DTW, and take the template that gave the minimal distance as the recognized word. (note: the "templates" are 10 recordings of each of the 6 dictionary-words).

  2. Extract MFCC coefficients for each frame, and run an SVM on the coefficients (for each dictionary word we had a different SVM classifier).

  3. Extract MFCC coefficients for each frame, and then define the feature vector as the vector of all distances from all the templates, then run an SVM on these features.


All three of these methods didn't work well (and gave almost everything wrong results). We don't know what the problem is.. A few questions that arose, but we don't know the answers, are:




  1. Do we really need 10 samples of each word? or can we "combine" them into a single template? and if so, how?

  2. Should we run the SVM directly on the MFCCs vectors? On the MFCC vector of vectors (for each word)? On the DTW values? or should we combine all the MFCC vectors into a single one?

  3. If we should look for a minimal match for the DTW between the input and the templates, should we give different weights to the templates depending on length, or even account for the differences between the templates (the templates of the word "1" are more distant from one another than the templates of the word "2").



We would appreciate any help, or references to good sources. We thought about using HMMs, but it seems much more difficult and not really possible in our time-frame..


Thanks.



Answer



What finally worked for me:





  1. Calculate the DTW between the MFCC vectors of each of the training samples.




  2. Classify in pairs (i.e. check if the word $w$ is in {class1, class2}. If the decision is class1, then $w\notin$class2. Then check {class1, class3}, etc...).




  3. Generate for each word $w$ and pair of classes {class1, class2} a 2D feature vector, $p$, as follows: $p_i$ = mean of the DTW between $w$ and all the training words of class-i.





  4. Now I generated the feature vectors for every word of the training set (and for every class-pair), and used them to train $n \choose 2$ SVM classifiers ($n$ - number of different classes).




  5. For the classification, we need to make $n$ tests (to rule out $n-1$ classes).





Notes:



  • More words in the dictionary $\Rightarrow$ the method will allow less "difference" between the original words (training data) and the input. e.g. users will encounter more wrong-classifications.


  • When using with a few words (2 words in my case), it works surprisingly well, and almost has no mistakes (even when users are trying to mislead it).

  • Using less than 10 samples will probably work (i'd recommend at least 2, though). I believe that for a bigger dictionary you should use more training data (not tested this).

  • Noise filtering is a must - when recording with or without a fan near me, it really moved the feature vectors in space, and made a few mistakes.


No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?

For example, if I say マイクロソフト内のパートナーシップは強いです, is the 内 here read as うち or ない? Answer 「内」 in the form: 「Proper Noun + 内」 is always read 「ない...