Wednesday, 4 November 2015

audio - Speech comparison algorithm for rating on similarities


I am trying to compare 2 speech samples and rate them on similarities. Think of someone trying to repeat a phrase, and then comparing those 2 audio files.


I started by implementing the MFCC (http://en.wikipedia.org/wiki/Mel-frequency_cepstrum) algorithm. I calculate the MFCCs of both audio samples, which gives me roughly 500 frames of audio (at 10ms each, with like 30% overlapping of the previous) having 14 or so MFCC coefficients. So a 500x14 matrix for each audio signal.


Then I do the naive approach of simply differencing the matrices. This does not give very promising results. Half of the time when I compare completely different audio samples (where different phrases are spoken), I get less difference than comparing the audio where I try to repeat the same phrase! This is clearly backwards and can't give me a good scoring algorithm.


How can I improve on this? I thought MFCCs were a really important part of speech processing, though clearly I need to do more with it.




Answer



First, you will have to correct for differences in timing. For example, if one utterance is "--heeelloooo---" and the other "hellooooooo----" (- representing silence), a direct pairwise comparison of MFCC frames will show differences simply because the two samples are not aligned. You can use Dynamic Time Warping to find the best alignment between the two sequence of feature vectors - and compute the corresponding distance.


The second problem is that if the two recordings are not from the same speaker, you will have to compensate for the differences in timbre. The MFCC of a female saying "aaa" are not the same as the MFCC of a male speaker saying the same phoneme! A relatively simple model to account for variations in voice timbre is to assume that there exists a linear transform $\Gamma$ which "maps" the MFCCs of one speaker onto the MFCCs of another speaker (to be fair, only one small subset of these transform accurately model how changing parameters such as age, gender, etc "shifts" the MFCC). Once two recordings have been aligned, even roughly, you can use a least square procedure to estimate $\Gamma$. This procedure is known as speaker normalization or speaker adaptation.


Your comparison procedure will thus consists of the following steps. $A$ and $B$ are your original MFCC sequences.



  • Align the two utterances using DTW. This yields $A'$, a matrix with the observations from $A$ warped/shifted to be aligned with the observations in $B$. You can stop here if $A'$ and $B$ are known to be from the same speaker.

  • Estimate the transform $\Gamma$ which minimizes the difference between $\Gamma A'$ and $B$.

  • Use the distance between $\Gamma A'$ and $B$ as your metric.


A last thing that comes to my mind is that you have to discard the first MFCC coefficient (which roughly expresses the signal loudness) to improve the capacity of your system to match utterances pronounced at a different volume/recording level.



No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?

For example, if I say マイクロソフト内のパートナーシップは強いです, is the 内 here read as うち or ない? Answer 「内」 in the form: 「Proper Noun + 内」 is always read 「ない...