Sunday, 16 August 2015

audio - Pitch Detection: avoiding frequency doubling / halving


I am working on a tuning app. I have so far tried 3 different libraries with a number of different algorithms. However, when I compare to other tuner apps, I seem to be getting frequency doubling on certain notes with certain instruments. Sometimes frequency halving depending on other factors. I have tried adjusting the buffer size (size of array of samples) and also the sampling frequency.


Algorithms include: Auto-Correlation, YINFFT, Dynamic Wavelet.


Is there sometimes a need to filter the signal before hitting the algorithm?



Answer



okay, this is answer Part 2. doing it as a separate answer because, as soon as there are LaTeX equations put in, the simultaneous rendering and typing get very slow.


so my questions (1) and (2) were meant to lead you to make a couple of basic conclusions that can help in understanding the source of the Octave Problem and, with such understanding, maybe craft code that can avoid some (maybe not all) of the octave errors.



about (1), the issue is, we hear stuff with our ears and brain (and as such we hear a "pitch" of a tone that is most often associated with the fundamental frequency $f_0$) but the Pitch Detection Algorithm (PDA) is not hearing anything but is doing math and making logical decision that it is programmed to do. so, mathematically, if a tone is judged to be periodic with fundamental frequency of 440 Hz, it is just as well a tone of 220 Hz or 146.67 Hz or 110 Hz or 88 Hz or 55 Hz. it is actually just as periodic with those fundamentals (and the periods associated with those fundamentals) as it is at 440 Hz.


so then, how do we choose which one. since these are all periods that are integer multiples of the same root period (1/440 second), we normally choose the smallest such period that satisfies the conditions of periodicity.


so then, what's the problem? seems like we have a well-defined rule: "choose the smallest value of $T$ that satisfies $x(t) \approx x(t+T)$."


problem 1: satisfying $x(t) \approx x(t+T)$ is the same as satisfying $x(t) - x(t+T) \approx 0$. how do you determine that? because this difference can be either positive or negative, then, to be consistent we try to pick $T$ to minimize something like


$$Q(T) \triangleq \int_{-\infty}^{+\infty} |x(t)-x(t+T)|^p w(t-t_0) \ dt$$


$w(t-t_0)$ is a window function centered at $t_0$ (the portion of the tone of current interest) and $p$ is some power. if $p=1$, then we have the Average Magnitude Difference Function (AMDF) and if $p=2$, we have the Average Squared Difference Function (ASDF), which i prefer for a number of nice mathematical reasons.


so when we say "$x(t) \approx x(t+T)$", we are also saying "$Q(T)$ is small". how small? and how small is necessary to say that $T$ is a potential period? that is a "thresholding problem" (the bane of many a DSP algorithm because of the ungraceful failures that occur when the threshold is not met). now, the way we sometimes get around threshold problems is to use a reasonably loose and inclusive threshold and get lots of pitch candidates (of which only one is the candidate you will ultimately choose). two criteria to pick one candidate over another is (a) which candidate has the shortest period (so we pick $T$ over $2T$) and (b) which candidate has the lowest $Q(T)$? but these two criteria do not always agree. so then which candidate do you choose? that is what IP, patents, and trade secrets are made of so i will go no further with that.


problem 2: this has to do with question (2) of the previous answer. Adamski, you almost got it right, but i would say that "but with a very slightly deeper tone" is off-the-mark. what if the 220 Hz added to the 440 Hz is 80 dB down (instead of 60 dB as i originally posed the question)? no one would hear any difference at all. yet, mathematically, it is a periodic function with period 1/220 second and not quite one of period 1/440 second. if you always choose the mathematically truest and bestest period, a little bit of inaudible sub-harmonic will screw you. so, somehow, you need to bias or prefer period candidates that are shorter over the longer periods that may fit better, mathematically. how to do that is also stuff that secret sauce is made of, so i am going no further with that.


problem 3: jitter or jumping around in pitch selection. usually, when you hear an octave error, it isn't that the pitch of whole note was judged to be an octave off (high or low) but that, while most of the note had the correct pitch assigned, as the note evolves in time (with changing timbre), some small snippet of that note jumps up or down an octave and, if that drives a synthesizer, will sound quite annoying. so, somehow, you need to put in a little hysteresis and make the pitch you select at earlier times be a little "sticky" and preferred, so that when there is suddenly a single frame that concludes that another candidate an octave high or low is the pitch, you can stick with the pitch you already have chosen for earlier frames. a clean way of doing that is also stuff of which commercial solutions are made, so i won't tell you exactly how to do that.


so, even though i didn't spell out a solution (i will if you pay me), i have hopefully pointed you in the right direction for you to figure it out with a little creative thinking.



No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?

For example, if I say マイクロソフト内のパートナーシップは強いです, is the 内 here read as うち or ない? Answer 「内」 in the form: 「Proper Noun + 内」 is always read 「ない...