In the scale-space theory the scale-space representation of the signal $f(x), x = (x_1, ..., x_d)$, (in case of image $d = 2$) is given as: $L(x, y; t) = g(x, y; t) * f(x, y)$ where $g(x, y; t)$ is a gaussian kernel with parameter $t$ and $*$ is a convolution. By changing the $t$ parameter we receive a more or less smoothed image. As the result coarser representation (parameter $t$) will not contain small objects or noise.
The main point is to find a way of scale-invariant feature detection, right? So that for some image at it's reduced in size copy the features like keypoints will be detected correctly, even if size is different, without finding other noise keypoints.
In the paper they are using the $\gamma$-normalized derivatives. $\delta_{\xi, \gamma-norm} = t^{\gamma / 2} \delta_x$. What is the meaning of using the $\gamma$-normalized derivative, how does it helps in scale-invariancy?
From this image we can see that at near the same positions the different keypoints found (different in size). How is that possible?
If you can explain the step-by-step algorithm of scale-invariant feature detection, this would be great. What is actually done? The derivatives can be taken by $x, y$ or $t$. Blob can be detected by taking the the derivative of $L$ by $(x, y)$ variables. How is the derivative by $t$ is helping here?
The paper I was reading is: Feature detection with automatic scale selection
Answer
It really has been a long time since I have read Lindeberg's papers, so the notation looks a bit strange. As a result, my initial answer was wrong. $\gamma$ is not a scale level. It seems to be a parameter of some sort that can be tuned. It is true that you need to multiply the derivative by the appropriate power of $t$. $t$ itself corresponds to a scale level, and the power depends on the order of the derivative.
You can find keypoints at multiple scales in the same location. That is because you look for the local maxima over scales. Here's the intuition: think of an image of a face. At a fine scale you get a blob corresponding to the nose. At a course scale you get a blob corresponding to the whole face. The two blobs are centered at the same point, but have different scales.
Here is the whole algorithm:
- Decide which image features you are interested in (e. g. blobs, corners, edges)
- Define a corresponding "detector function" in terms of derivatives, e. g. a Laplacian for blobs.
- Compute derivatives that you need for your detector function at a range of scales.
- Multiply the derivative responses by $t^{m \gamma / 2}$, where $m$ is the order of the derivative, to compensate for magnitude decrease.
- Compute the detector function over the whole scale space.
- Find local maxima of the detector function over $x, y, t$.
- These are your interest points, or keypoints.
Edit:
- Lindeberg proves in the paper that $t^{\gamma / 2}$ is the appropriate factor for normalization of derivatives. I don't think I can reproduce the proof here.
- You do not take derivatives with respect to $t$. You only compute derivatives with respect to $x$ and $y$, but you compute them at a range of scales. One way to think about this is you generate a Gaussian scale space first, by repeatedly blurring the image with a Gaussian filter of some variance $t$. Then compute derivatives with respect to $x$ and $y$ at each scale level.
- You want to find local maxima over scales because you may have image features of different size at the same location. Think of an image of concentric circles, like a bulls-eye. It will give you high responses of a Laplacian at several scales. Or think of an image of a real human eye filtered by a Laplacian at a range of scales. You will get a high response at a fine scale for the pupil, a high response an some medium scale for the iris, and a high response at a coarse scale for the whole eye.
The whole point is that you do not know at what scale the features of interest might be ahead of time. So you look at all scales.
No comments:
Post a Comment