Friday 4 March 2016

fft - About voice filtering


I have a waveform containing



  1. voice

  2. some children screaming

  3. some times when there is no noise at all

  4. some times there are cars driving by when talking


I want to detect the start and ending times of the loudest voice or the areas in which the same speaker is talking to cut them out.



What I did was a FFT on the waveform to realize that voice means there is(in a small time unit of a few ms) always a high amplitude at the base frequency to be followed by equidistant frequencies, being apart in the integral multiple of the base frequency.


But this also applied to some noises.


The goal is to detect voice and cut the parts containing voice out(ie. keep them) with all background sounds. It is not about only having voice but finding out where voice is and cutting it out with all cars children or whatever was surrounding the microphone while recording. The problem is that I get many false positives when music(no voice) is being played or some squealing sound can be heard.


What is a better approach to filter out the times voice is at in a wav file with these given features?



Answer



I don't think filtering is the way to go. The spectra of the different signals (the desired voice, the screamy children and the cars) are probably overlapping, so there is no real way to get rid of all that noise without spoiling your signal.


Actually, the approach would depend in how loud the noise is in comparison to the signal. If the speaker is near the microphone, and therefore the relative noise level is ''low'', this sounds like a VAD kind of problem.


If noise level is quite low, an energy-based VAD may work: after windowing the input signal (let's say to 20ms windows), each window's energy can be computed. Then, a threshold can be applied in order to decide if that window is noise or voice. You would simply discard the windows that aren't load enough to meet the threshold. This is unarguably the simplest approach.


A simple improvement to this kind of VAD, assuming you only have one speaker, is to reduce the bandwidth in which you apply the stated algorithm. This way, if the kids voice spectrum begins at about e.g. 500Hz, and the cars induce some low frequencies up to e.g. 80Hz, you could use a band pass (e.g. 100Hz - 300Hz, assuming the speaker fundamental frequency is about 200Hz) filtered signal to decide wether each window is or isn't noise, and select those who aren't in the original signal. I know this is not ideal, as too many coincidences are needed to have a reliable system.


Unfortunately, this won't work when noise gets louder. Furthermore, some phonems may be discarded due to their low energy, such as /s/ or /z/. This just makes speech sound weird. There are some more advanced VAD techniques such as Sohn's A Statistical Model-Based Voice Activity Detection. You can find some Matlab implementations in the net. I have used this in addition to spectral substraction noise reduction (which doesn't seem to be needed in your case) in a noisy environment (babble + WGN) and results were susprisingly good.





Matlab code for using Voicebox's Sohn VAD.


in_file='yourfile.wav';
[s,fs,wmode,fidx]=readwav(in_file,'p',-1,3500); %3500 is the number of samples to discard at the beginning of your file.
[y1,zo]=vadsohn(s,fs); %y1 is 1 when there is voice, 0 when there isn't
y=s(1:length(y1)).*y1; %apply mask
y(1)=0;

for i=2:length(y)
if y(i-1)
StartTime=(i-2)/fs
elseif y(i-1)>y(i)
EndTime=(i-2)/fs
end
end

No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?

For example, if I say マイクロソフト内のパートナーシップは強いです, is the 内 here read as うち or ない? Answer 「内」 in the form: 「Proper Noun + 内」 is always read 「ない...