Wednesday 13 April 2016

real time - YIN Pitch detection Algortithm ( how do I improve my results )


I am using YIN algorithm in a school project of mine which uses pitch detection on guitar sound. I when I play a note I get random frequencies at the beginning until they stabilize. I am thinking those are probably from action of pick on the strings.


I am going through the original paper:



Cheveigne A, Kawahara H. - YIN, a fundamental frequency estimator for speech and music



trying to reverse engineer the library and improve my results. I am studying computer science and my knowledge of signal processing is limited. A summary:


Step 1) Auto-correlation :- We try to find the correlation of the signal with itself shifted by a lag within a window.



enter image description here


Function can possibly have infinite values. We chose the highest peak with non-zero lag. Within a lag range.(Why the highest peak? Does it mean the loudest frequency?). The paper says if upper limit for $\tau$ is high. The algorithm may chose higher order peak (what are higher order peak?)


The following steps are to improve the accuracy of the results


Step 2) Difference Function :- Model the signal in form of a difference function. $$ d(\tau) = \sum_{j=1}^{j=W} (x_j - x_{j+\tau})^2 $$. Which gives : $$ d(\tau) = \sum_{j=1}^{j=W} (r_t(0) + r_{t+ \tau}(0) -2r_t(\tau)$$


So we're basically using amplitude as bias.


step 3) Cumulative mean normalization Replace the difference function by cumulative normalized difference to avoid selecting value with zero lag:


$$d_t(\tau')=\frac{d_t(\tau)}{(1/\tau)\sum_{j=1}^{\tau}d_t(j)}$$


step 4) Absolute Threshold ( Could Anyone explain this section?)


step 5) Parabolic Interpolation : Fit the $d(\tau)$ estimates to a parabolic curve.


step 6) Choose The Best Local Estimate : Self explanatory



I am trying to compare guitar sound with a monophonic midi.


I think the parameters I should be thinking about tinkering with are window size and threshold to improve my results or I could discard first few frames. Could anyone point me in the right direction?


The parameters I am using :


SAMPLE RATE : 44100


WINDOW SIZE : 1024


HOP SIZE : 512


THRESHOLD : 0.1



Answer



If you are not doing this in low-latency real-time, you can work backwards from the stable portion of the pitch estimate to the transient attack portion of the waveform at the beginning.


The sound of a plucked guitar string evolves in a possibly predictable pattern over time (e.g. more so than voice). If you can estimate the onset time and/or have neighboring pitch estimates, you can adaptively set window sizes and threshold levels over time to more optimal values, as potentially determined by experimentation on some data set of guitar notes. You can also use statistical decision theory to determine if any local pitch estimate fits the history of a reasonable spectral evolution of any guitar note, and reject outliers as noise, transients, harmonics or octave errors (harmonic and octave errors potentially being correctable errors). This is especially useful working backwards in time, as the attack is usually noisier than the sustain/decay portion of a note's evolution.



Some improvement in steps 4) and 6) can also be acquired from psycho-acoustic experimentation with human listeners. For instance: For an octave difference to be perceived above a certain error rate by typical human listeners, how much normalized difference in octave pitch estimate peaks over what amount of time is required? Any difference over a smaller amount of time might be imperceptible.


ADDED: A window size of 1024 at a sample rate of 44100 only allows correlating/using a little more than 2 periods of the pitch of the lowest string of a guitar (E4 = 82.40Hz). A 3X or 4X longer window might be more reliable for the lowest guitar notes, but shorter windows would be more responsive or provide better time locality for the higher fretted notes.


No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?

For example, if I say マイクロソフト内のパートナーシップは強いです, is the 内 here read as うち or ない? Answer 「内」 in the form: 「Proper Noun + 内」 is always read 「ない...