Wednesday, 4 November 2015

filters - Simplest way of detecting where audio envelopes start and stop


Below is a signal which represents a recording of someone talking. I would like to create a series of smaller audio signals based on this. The idea being to detect when 'important' sound starts and ends and use those for markers to make new snippet of audio. In other words, I would like to use the silence as indicators as to when an audio 'chunk' has started or stopped and make new audio buffers based on this.


So for example, if a person records himself saying


Hi [some silence] My name is Bob [some silence] How are you?


then I would like to make three audio clips from this. One that says Hi, one that says My name is Bob and one that says How are you?.


My initial idea is to run through the audio buffer constantly checking where there are areas of low amplitude. Maybe I could do this by taking the first ten samples, average the values and if the result is low then label it as silent. I would proceed down the buffer by checking the next ten samples. Incrementing along in this way I could detect where envelopes start and stop.


If anyone has any advice on a good, but simple way to do this that would be great. For my purposes the solution can be quite rudimentary.


I'm not a pro at DSP, but understand some basic concepts. Also, I would be doing this programmatically so it would be best to talk about algorithms and digital samples.


Thanks for all the help!


enter image description here




EDIT 1


Great responses so far! Just wanted to clarify that this is not on live audio and I will be writing the algorithms myself in C or Objective-C so any solutions that use libraries aren't really an option.



Answer




This is the classic problem of speech detection. First thing to do would be to Google the concept. It is widely used in digital communication and there's been a lot of research conducted on the subject and there are good papers out there.


Generally, the more background noise you have to deal with the more elaborate your method of speech detection must be. If you're using recordings taken in a quiet room, you can do it very easily (more later). If you have all sorts of noise while someone is talking (trucks passing by, dogs barking, plates smashing, aliens attacking), you'll have to use something much more clever.


Looking at the waveform you attached, your noise is minimal, so I suggest the following:



  1. Extract signal envelope

  2. Pick a good threshold

  3. Detect places where envelope magnitude exceeds threshold


What does this all mean? An envelope of a signal is a curve that describes its magnitude over time, independently of how its frequency content makes it oscillate (see image below).


enter image description here



Envelope extraction can be done by creating a new signal that contains absolute values of you original signal, e.g. $\{ 1, 45, -6, 2, -43, 2 \ldots \}$ becomes $\{ 1, 45, 6, 2, 43, 2 \ldots \}$, and then low-pass filtering the result. The simplest low-pass filter can be implemented by replacing each sample value by an average of its N neighbors on both sides. The best value of N can be found experimentally and can depend on several things such as your sampling rate.


You can see from the image that is you don't have much noise present, your signal envelope will always be above a certain threshold (loudness level), and you can consider those regions as speech detected regions.


No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?

For example, if I say マイクロソフト内のパートナーシップは強いです, is the 内 here read as うち or ない? Answer 「内」 in the form: 「Proper Noun + 内」 is always read 「ない...