Monday 27 June 2016

self study - Unable to understand the derivation of the update equation for LMS


I am trying to follow the derivation of the Least Mean Square https://en.wikipedia.org/wiki/Least_mean_squares_filter#Proof but I cannot get the update rule. I am stuck in the following steps and shall appreciate help where I am going wrong. The transmitter generates a sequence of random input data $\{s_n\}$, each element of which belongs to an alphabet $\mathcal{A}$. The data sequence is sent through the channel whose output $y_n$ is observed by the receiver. The input/ output relationship of the channel can be written as: \begin{align} y_n &= \sum_{k=1}^L h_k s_{n-k} + v_n,\nonumber\\ &= \mathbf{h}^H \mathbf{s}_n + v_n. \tag{1} \end{align} where the additive noise $v_n$ is assumed to be stationary, Gaussian, and independent of the channel input $s_n$. We also assume that the channel is stationary, moving-average and has finite length. The output of the equalizer is: \begin{align} z_n &= \sum_{k=1}^L {w}_k y_{n-k},\nonumber\\ &=\mathbf{{w}}^H \mathbf{y}_n = \hat{s}_n \tag{2} \end{align} where the superscript $H$ denotes conjugate transpose, $\mathbf{y}_n =[y_n,y_{n-1},\ldots,y_{n-L+1}]^T$ is the signal to be equalized, $z_n$ is the output of the equalizer and $\mathbf{{w}} = [{w}_1,{w}_2,\ldots,{w}_L]^T$ is the vector of equalizer coefficients.


The cost function for LMS is $J = E[|e_n|^2]$ which is minimized. \begin{align} e_n &= s_n - \hat{s}_n,\\ &=s_n - \mathbf{{w}}^H \mathbf{y}_n \tag{3} \end{align} We need to take the gradient of the cost function and then equate it to zero to yield the equation for the equalizer coefficients. The gradient is \begin{align} \nabla J(\mathbf{w}) &= \nabla E[e_ne^*_n],\\ &=2E[\nabla(e_n)e^*_n] \tag{4} \end{align} Now, $\nabla(e_n) = \nabla(s_n - \mathbf{{w}}^H \mathbf{y}_n) = -\mathbf{y}_n$. Substituting this into Eq(5) and equating to zero we get, \begin{align} \nabla J(\mathbf{w}) &= \nabla E[e_ne^*_n],\\ &=2E[-\mathbf{y}_ne^*_n] \tag{5} \end{align} Using Eq(3), \begin{align} \nabla J(\mathbf{w}) =0,\\ 2E[-\mathbf{y}_ne^*_n] =0,\\ \mathbf{y}_ne^*_n = 0,\\ \mathbf{y}_n(s_n^* - \mathbf{{w}} \mathbf{y}_n^*) =0,\\ \mathbf{{w}} = {(\mathbf{y}_n \mathbf{y}_n^*)}^{-1}(\mathbf{y}_n s_n^*) \tag{6} \end{align}



I cannot understand how to plug in the $\mu$, the step size. This Eq(5) is completely different from the update equation :$\mathbf{{w}}_{n+1}=\mathbf{{w}}_n + \mu e_ns_{n}^*$



Answer



so with an LMS filter, we have a time-variant $N$-tap FIR filter:


$$ y[n] = \sum\limits_{k=0}^{N-1} h_n[k] \, x[n-k] $$


$x[n]$ is the input signal, $y[n]$ is the FIR output, and $h_n[k]$ are the FIR tap coefficients at the time of sample $n$.


with an LMS filter, we also have another input called the desired signal: $d[n]$. we want our LMS filter to adapt so that the output $y[n]$ follows the desired signal $d[n]$. from those two signals, we define an LMS error signal:


$$ e[n] \triangleq y[n] - d[n] $$


and this Least Mean Square adaptive filter will adjust its tap coefficients, $h_n[k]$, in such a manner as to eventually minimize the square of this error signal $e[n]$.


so the error signal is:


$$\begin{align} e[n] &= y[n] - d[n] \\ &= \sum\limits_{k=0}^{N-1} h_n[k] \, x[n-k] - d[n] \\ \end{align}$$



and the square of this error signal (which we want to lessen in an LMS filter) is:


$$ \big(e[n]\big)^2 = \left( \sum\limits_{k=0}^{N-1} h_n[k] \, x[n-k] - d[n] \right)^2 $$


now, one thing Bernard Widrow must have been thinking is that, since the inputs $x[n]$ and $d[n]$ are given, the only controls we can adjust to lessen this square error $e^2[n]$ are the $N$ tap coefficients $h_n[k]$ for $0 \le k \le N-1$. with $x[n]$ and $d[n]$ fixed, then $e^2[n]$ is a function, $f(\cdot)$, of the $N$ tap coefficients $h_n[k]$. so at time $n$ this function of $N$ variables looks like


$$\begin{align} e^2[n] & \triangleq f\big(h_n[0], h_n[1], h_n[2], ... h_n[N-1] \big) \\ &= \left( \sum\limits_{k=0}^{N-1} h_n[k] \, x[n-k] - d[n] \right)^2 \\ \end{align}$$


at time $n$, the derivative of $f(\cdot)$ with respect to the $m$-th tap coefficient is


$$\begin{align} \frac{\partial}{\partial h_n[m]} \, e^2[n] &= \frac{\partial}{\partial h_n[m]} \, f\big(h_n[0], h_n[1], h_n[2], ... h_n[N-1] \big) \\ &= \frac{\partial}{\partial h_n[m]} \, \left( \sum\limits_{k=0}^{N-1} h_n[k] \, x[n-k] - d[n] \right)^2 \\ &= 2 \, \left( \sum\limits_{k=0}^{N-1} h_n[k] \, x[n-k] - d[n] \right) \cdot x[n-m] \\ &= 2 \, e[n] \cdot x[n-m] \\ \end{align}$$


now from this we can set all of these derivatives to zero and set up $N$ equations and $N$ unknowns and try to solve the mess. i am sure Widrow did this. but it doesn't lead to a very cheap (in terms of computational cost) algorithm.


but for each $m$, where $0 \le m \le N-1$, the quantity


$$ \frac{\partial}{\partial h_n[m]} \, f\big(h_n[0], h_n[1], h_n[2], ... h_n[N-1] \big) = 2 e[n] x[n-m] $$


is one component of the vector which is the gradient of $f(\cdot)$.



for the moment, try to visualize this as a function of two variables: $f\big(h_n[0], h_n[1]\big)$. this is a surface above the point in a plane with coordinates $(h_n[0], h_n[1])$. the gradient


$$\begin{align} \left(\tfrac{\partial f(\cdot)}{\partial h_n[0]}, \ \tfrac{\partial f(\cdot)}{\partial h_n[1]}\right) &= \Big(2 e[n] x[n-0], \ 2 e[n] x[n-1] \Big) \\ &= 2 \, e[n] \Big(x[n], \ x[n-1] \Big) \\ \end{align} $$


is the 2-dim vector that points in the direction of the steepest increasing slope of this surface. so if $\tfrac{\mu}{2}$ is a tiny positive number, moving a tiny bit in the direction of the gradient should increase this function a tiny bit.


$$ f\big(h_n[0], \ h_n[1]\big) < f\big(h_n[0] + 2 e[n] x[n] \tfrac{\mu}{2}, \ h_n[1] + 2 e[n] x[n-1] \tfrac{\mu}{2} \big) $$


but $f(\cdot)$ is the square of error and we want to decrease it, so we move in the opposite direction of the gradient:


$$ f\big(h_n[0], \ h_n[1]\big) > f\big(h_n[0] - 2 e[n] x[n] \tfrac{\mu}{2}, \ h_n[1] - 2 e[n] x[n-1] \tfrac{\mu}{2} \big) $$


that's with two taps ($N=2$). extending the idea of this "surface" to more dimensions than two, we expect


$$ f\big(h_n[0]-e[n]x[n]\mu, ... ,h_n[k]-e[n]x[n-k]\mu, ... \big) < f\big(h_n[0], ... ,h_n[k], ... \big) $$


so, for a small stepsize $\mu$, the adaptation algorithm (or coefficient "update equation") is


$$ h_{n+1}[k] \ \leftarrow \ h_n[k] \ - \ (\mu \, e[n]) \, x[n-k] $$



that is the simple unnormalized LMS adaptive filter. you want $\mu$, also known as the adaptation gain, to be large enough so that the LMS filter adapts at an acceptably quick rate, but small enough so that it doesn't overshoot in the adaptation and become unstable.


because the stepsize is really $(\mu \, e[n]) x[n-k]$ and both $e[n]$ and $x[n-k]$ are expected to be proportional in magnitude to the mean $\big|x[n]\big|$, this means louder signals will cause the adaptation to happen faster than quieter signals. so from this, the Normalized LMS adaptive filter is defined to divide that term by some measure of the mean of $\big|x[n]\big|^2$.


so the Normalized LMS adaptation algorithm (or coefficient update) is


$$ h_{n+1}[k] \ \leftarrow \ h_n[k] \ - \ \frac{\mu \, e[n]}{\overline{x^2[n]} + \epsilon} \, x[n-k] $$


where $\overline{x^2[n]}$ is some measure of mean of $x^2[n]$ for the samples most recent to time $n$. you can simply square $x[n]$ and run that through a simple first-order IIR filter to get a measure of the mean:


$$ \overline{x^2[n]} \ \leftarrow \ p \, \overline{x^2[n-1]} \ + \ (1-p) x^2[n] $$


where $p$ is the pole of the one-pole IIR filter and $0 < (1-p) \ll 1$. the quantity $\epsilon > 0$ is a tiny amount to add to the denominator so that we never divide by zero, even if the input $x[n]$ goes to zero.


No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?

For example, if I say マイクロソフト内のパートナーシップは強いです, is the 内 here read as うち or ない? Answer 「内」 in the form: 「Proper Noun + 内」 is always read 「ない...