I'm writing a program to emulate a vocoder and it obviously works, but any speech that comes through is very difficult to understand. I'm going to explain exactly how my implementation works and then maybe somebody with experience with vocoders or DSP can help me.
128 samples of the voice signal go through a Fast Fourier Transform every audio frame. I then take the absolute value of the complex outputs to get the formant of the voice. I chose this rather than multiple band pass filters and envelope followers because it's faster just to do an FFT. I am 100% sure that this is all done correctly.
My carrier wave is run through several band pass filters centered around frequencies corresponding to the FFT output frequencies. To construct the output I multiply the output of these filters by the corresponding amplitudes in the voice formant obtained from the FFT and add all of these products.
There's a few things that I suspect may be causing issues. The algorithm I used to implement the band pass filters is probably not the best. I feed the output of a low pass filter into a high pass filter. I got the algorithms for the filters under the "Algorithmic Implementation" section of the low pass and high pass filter articles on Wikipedia. I have a strong feeling that these simple filters may not be attenuating unwanted frequencies very well but I can't find a better way to make my band pass filter. I also suspect that my carrier wave might not be very good. As I understand it, you need a carrier wave rich in lots of different frequency content (so obviously not a sine wave). I've listened to a lot of vocoders and the carrier wave usually sounds like it's got some dissonant notes in it. I've tried additive synthesis and FM synthesis but I just can't get a carrier wave that doesn't sound too "clean".
If somebody could give me a better band pass algorithm (if that's what I need), a good formula for a carrier wave, or point out some other likely problem I would be very grateful.
ALSO, "vocoder" is not an available tag. Maybe somebody with enough permissions could kindly add this tag? Also, there's a tag for signal-processing but not digital-signal-processing or DSP, I'm not sure if it would be appropriate to add one of these tags or not.
Answer
First, it seems to me that you are doing the analysis incorrectly. 128 linearly spaced channels is both too high (it won't capture a smoothed spectral envelope - vocoders usually have at most ~30 channels) and too small (it has a very poor resolution in the lowest frequencies and won't discriminate different formants). You really need to replace that by a bank of constant-Q band-pass filters, 2 or 3 bands / octave to get something that is coarse enough to capture the spectral envelope without picking individual frequency peaks, yet discriminates the frequencies of interest.
Your analysis and synthesis filters can be implemented with biquads. Running a dozen of those in parallel is not cost-prohibitive compared to a FFT.
As for the carrier wave, you can start with a sawtooth or narrow pulse - avoid waveforms like triangle, square, or sine whose spectrum have "gaps". Then stack a few of them to play a chord. You can thicken the carrier by adding slightly detuned version of each component or octaves. Note that in order to achieve the best sounding results, band-limited synthesis (for example with minBLEPs) has to be used to synthesize the square or sawtooth waves.
No comments:
Post a Comment