I know basics about DSP, and now trying to complete a project on Speech recognition. So I read as many resources as I found, and got some ideas. They are like:
- take a sample sound and break it into small frames
- calculate MFCC for each frames( 13 of them)
use DTW to match the testing and template pattern.
Now, here are the things I am having trouble with-
13 MFCC's for n frames leaves me with 13*n MFCC's. How exactly do I use DTW here?
- DTW gives me a distance between two patterns. How do I determine if they match or not?
Answer
At the very beginning let me warn you that DTW approach is suitable only for spoken word recognition. Nonetheless it is interesting as a basic exercise.
I assume that you have a database of training files (templates) and you already extracted MFCC's for those. Each training file contains one utterance for a given word class, i.e. "Hello", "Bye". I also imagine that each class has more than one example (plausible).
Here is what you should do during the recognition stage:
- Once you have the recording for the whole word, you calculate the MFCC's for each frame. You want the recording of the whole word, not just part of it.
- Now you calculate the distance between your recording and each of the templates in your database. In case of DTW you will be calculating the cost between each of 13 dimensional frames (simple distance metric/norm, i.e. Euclidean, Manhattan, etc.). Once the DTW algorithm is finished, you will end up with the distance value (upper right matrix entry as can be seen below) between your test sample and each of the templates.
- The last step is to make a decision: to which class your test sample actually belongs to? You can either do it by picking the class of template with the minimum DTW distance. But what is even better, you can use the k-Nearest Neighbours for that.
No comments:
Post a Comment