如何在 SIFT 中使用 DoG 金字塔
我对图像处理和模式识别非常陌生。我正在尝试实现 SIFT 算法,在该算法中我能够创建 DoG 金字塔并识别每个八度音程中的局部最大值或最小值。我不明白的是如何在每个八度音阶中使用这些局部最大/最小值。我如何结合这些点?
我的问题听起来可能很微不足道。我读过Lowe的论文,但无法真正理解他在建造DoG金字塔后做了什么。 任何帮助表示赞赏。
谢谢
I am very new in image processing and pattern recognition. I am trying to implement SIFT algorithm where I am able to create the DoG pyramid and identify the local maximum or minimum in each octave. What I don't understand is that how to use these local max/min in each octave. How do I combine these points?
My question may sound very trivial. I have read Lowe's paper, but could not really understand what he did after he built the DoG pyramid.
Any help is appreciated.
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
基本上,他在构建 DoG 金字塔后所做的就是检测这些图像中的局部极值。之后,他丢弃了一些检测到的局部极值,因为它们可能不稳定。识别那些不稳定关键点/特征的过程通过两个步骤完成:
能够执行这些步骤,首先需要通过泰勒级数展开来得到极值的真实位置。它将为您提供解决这两个步骤的信息。
最后一步是构建描述符......
我也在研究这个算法,我觉得理解它并不那么简单。 Lowe 的论文中没有包含一些细节,因此更难理解。我还没有找到很多额外的资源来更深入地解释该算法,但有一些开源实现,因此您也可以使用它们。
编辑:更多信息:)
您链接的论文是他的早期作品,您应该获得最新版本的论文,因为有一些修改。在搜索更多资源时,我也阅读了他的专利,并且它还包含旧信息,因此您也不应该查看那里。
所以,我对这个尺度空间极值步骤的理解如下。首先,我们需要建立一个高斯金字塔。 Paper 说,为了局部极值完整性,我们需要在每个八度音程中构建 s+3 高斯图像。经过一些测试后,Lowe 得出结论,对于s = 3,他获得了最好的结果。这意味着每个八度音程中有 6 个高斯图像,从中我们得到 5 个 DoG 图像。请注意,所有这些 DoG 图像都具有相同的分辨率。仅当传递到下一个八度音程时才进行重新采样。
下一步是找到局部极值。 Lowe 建议在 26 个邻域内进行搜索,这意味着我们应该从第二张图像开始搜索,因为这是存在 26 个邻域的第一张图像。同样,我们停止对第四张图像的搜索。对每个八度音程单独重复此过程。对于找到的每个极值,至少应该保存其位置和尺度。找到极值后,下一步将是使用泰勒级数完成的更准确的定位。
这是我对这一步如何运作的理解,我希望我离事实不太远:)
希望这能有所帮助。
Basically what he does after building the DoG pyramid is detecting local extrema in those images. Afterwards, he discards some of the detected local extrema because they're probably unstable. Process of identifying those unstable keypoints/features is done by two steps:
To be able to do these steps, first you need to get the true location of extrema by taking a Taylor series expansion. It will give you information to solve those two steps.
Final step is to build descriptors ...
I'm in a process of studying this algorithm as well and i don't find it so trivial to understand. There are some details that are not included in Lowe's paper so that's what it makes it harder to understand. I haven't found many extra resources which will explain this algorithm more in depth but there are some open source implementations so you could also make use of them.
EDIT: more information :)
Paper you linked is his early work and you should get the newest version of paper because there are some modifications. Searching for more resources I've read his patent as well and it also contains old information so you shouldn't look there either.
So, my understanding of this scale-space extrema step is as it follows. First, we need to build a Gaussian pyramid. Paper says that for local extrema completeness we need to build s+3 Gaussian images in each octave. Having some tests Lowe concluded that for s = 3 he gets the best results. So that implies we have 6 Gaussian images in each octave from which we get 5 DoG images. Note that all these DoG images have the same resolution. Re-sampling is done only when passing to next octave.
Next step would be finding a local extrema. Lowe proposes to search within a 26 neighborhood which means that we should start our search from second image because that's the first image for which 26 neighborhood exists. Similarly we stop our search on fourth image. This process is repeated for each octave individually. For each extrema found, at least you should save its location and its scale. Having extrema found next step would be more accurate localization which is done with Taylor series.
This is my understanding how this step works and i hope I'm not too far from the truth :)
Hope this helped a little bit more.
vlfeat 是一个开源库,实现了多种计算机视觉算法,包括 SIFT。您应该能够查看该源代码以更好地了解正在做什么。
如果您正确地找到了每个八度音阶中的极值,那么您可以:
极值的规模和位置
响应
对于此时剩余的每个特征,
我不确定这有多大帮助,因为我不知道你在哪里被挂断了。
vlfeat is an open source library implementing several computer vision algorithms, including SIFT. You should be able to look at that source code to get a better idea of what's being done.
If you're properly finding the extrema in each octave, you can then:
scale and location of the extrema
responses
For each feature remaining at this point,
I'm not sure how much help this has been, because I don't know where you're getting hung up.
我们有两座金字塔。高斯金字塔和 DoG 金字塔。高斯金字塔有 6 个模糊图像。 DoG是这些图像的差异,因此DoG中有5张图像。
你与高斯金字塔无关。请注意,所有这些都在第一个八度!创建第一个金字塔时,调整图像大小并开始为第二个八度音程构建新金字塔。
假设您的原始图像是 512x512。在第一个八度音程中,所有图像均为 512x512,但在第二个八度音程中,所有图像均为 256x256。同样,您有 6 个高斯金字塔图像和 5 个 DoG 金字塔图像。但第二个 ocave 的尺寸都是 256x256。无需提及第三个八度。
现在进行最小值和最大值的匹配:(您位于第一个八度音阶)
假设您在第一个八度音阶中寻找最大值。您必须使用 DoG 金字塔并从第二张图像开始。您获取一个像素并计算它是否为最大值。在此计算中,您应该使用 DoG 金字塔的第一、第二和第三图像。如果完成,则通过考虑第二、第三和第四图像来找到第三图像中的最大值。最后通过考虑第三、第四和第五图像来找到第四图像中的最大值。
现在在第一个八度音阶中找到 mixama 已完成,转到下一个八度音阶并重复这些步骤。
We have two pyramids. A Gaussian and a DoG pyramid. Gaussian pyramid has 6 blurred images. DoG is difference of these images, so there are 5 images in DoG.
You have nothing to do with Gaussian pyramid. Note that all these are in first octave! When you create your first pyramid, resize your image and start to build new pyramids for second octave.
Lets say your original image is 512x512. In first octave all images are 512x512 but in second octave, all images are 256x256. Again you have 6 images Gaussian pyramid and 5 in DoG pyramid. But all of them are 256x256 in second ocave. No need to mention 3rd octave.
Now for the matching of minima and maxima:(you are in first octave)
Lets say you are looking maxima in first octave. You must use DoG pyramid and start from 2nd image. You take a pixel and calculate if it is maxima. In this calculation you should use 1st,2nd and 3rd images of DoG pyramid. If it is done go and find maxima in 3rd image by considering 2nd,3rd and 4th images. And lastly go find maxima in 4th image by considering 3rd,4th and 5th images.
Now finding mixama in first ocatave is completed, go to next octave and repeat these steps.