比较两个相同大小的位图以确定它们是否相同的最快方法是什么?
我正在尝试编写一个函数来确定两个相同大小的位图是否相同。我现在拥有的函数只是一次比较每个位图中的一个像素,在第一个不相等的像素处返回 false。
虽然这很有效,并且适用于小位图,但在生产中我将在紧密循环中和更大的图像上使用它,所以我需要一种更好的方法。有人有什么建议吗?
顺便说一句,我使用的语言是 C# - 是的,我已经在使用 .LockBits 方法。 =)
编辑:我已经对给出的一些建议的实现进行了编码,下面是基准。设置:两个相同的(最坏情况)位图,大小为 100x100,每个位图迭代 10,000 次。结果如下:
CompareByInts (Marc Gravell) : 1107ms
CompareByMD5 (Skilldrick) : 4222ms
CompareByMask (GrayWizardX) : 949ms
在 CompareByInts 和 CompareByMask 中,我使用指针直接访问内存;在 MD5 方法中,我使用 Marshal.Copy 检索字节数组并将其作为参数传递给 MD5.ComputeHash。 CompareByMask 只是稍微快一点,但考虑到上下文,我认为任何改进都是有用的。
谢谢大家。 =)
编辑 2:忘记打开优化 - 这样做会给 GrayWizardX 的答案带来更多的提升:
CompareByInts (Marc Gravell) : 944ms
CompareByMD5 (Skilldrick) : 4275ms
CompareByMask (GrayWizardX) : 630ms
CompareByMemCmp (Erik) : 105ms
有趣的是 MD5 方法根本没有改进。
编辑3:发布了我的答案(MemCmp),它让其他方法大吃一惊。 oO
I am trying to write a function to determine whether two equal-size bitmaps are identical or not. The function I have right now simply compares a pixel at a time in each bitmap, returning false at the first non-equal pixel.
While this works, and works well for small bitmaps, in production I'm going to be using this in a tight loop and on larger images, so I need a better way. Does anyone have any recommendations?
The language I'm using is C# by the way - and yes, I am already using the .LockBits method. =)
Edit: I've coded up implementations of some of the suggestions given, and here are the benchmarks. The setup: two identical (worst-case) bitmaps, 100x100 in size, with 10,000 iterations each. Here are the results:
CompareByInts (Marc Gravell) : 1107ms
CompareByMD5 (Skilldrick) : 4222ms
CompareByMask (GrayWizardX) : 949ms
In CompareByInts and CompareByMask I'm using pointers to access the memory directly; in the MD5 method I'm using Marshal.Copy to retrieve a byte array and pass that as an argument to MD5.ComputeHash. CompareByMask is only slightly faster, but given the context I think any improvement is useful.
Thanks everyone. =)
Edit 2: Forgot to turn optimizations on - doing that gives GrayWizardX's answer even more of a boost:
CompareByInts (Marc Gravell) : 944ms
CompareByMD5 (Skilldrick) : 4275ms
CompareByMask (GrayWizardX) : 630ms
CompareByMemCmp (Erik) : 105ms
Interesting that the MD5 method didn't improve at all.
Edit 3: Posted my answer (MemCmp) which blew the other methods out of the water. o.O
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
12 年 8 月 31 日编辑:根据下面的 Joey 的 评论,请注意您比较的位图的格式。它们可能在步幅上包含填充,导致位图不相等,尽管在像素方面是等效的。有关更多详细信息,请参阅此问题。
阅读关于比较字节数组的问题的这个答案已经产生了结果更快的方法:使用 P/Invoke 和 msvcrt 中的 memcmp API 调用。这是代码:
Edit 8-31-12: per Joey's comment below, be mindful of the format of the bitmaps you compare. They may contain padding on the strides that render the bitmaps unequal, despite being equivalent pixel-wise. See this question for more details.
Reading this answer to a question regarding comparing byte arrays has yielded a MUCH FASTER method: using P/Invoke and the memcmp API call in msvcrt. Here's the code:
如果您试图确定它们是否 100% 相等,则可以将其中之一反转,然后将其与另一个相加(如果其为零,则它们相同)。使用不安全代码对此进行扩展,一次取 64 位作为 long,并以这种方式进行数学计算,任何差异都可能导致立即失败。
如果图像不是 100% 相同(比较 png 和 jpeg),或者如果您不是在寻找 100% 匹配,那么您还有更多工作要做。
祝你好运。
If you are trying to determine if they are 100% equal, you can invert one and add it to the other if its zero they are identical. Extending this using unsafe code, take 64 bits at a time as a long and do the math that way, any differences can cause an immediate fail.
If the images are not 100% identical (comparing png to jpeg), or if you are not looking for a 100% match then you have some more work ahead of you.
Good luck.
好吧,您正在使用
.LockBits
,所以大概您正在使用不安全的代码。不要将每行原点 (Scan0 + y * Stride
) 视为byte*
,而是考虑将其视为int*
;int
算术非常快,您只需要做 1/4 的工作。对于 ARGB 格式的图像,您可能仍然以像素为单位进行讨论,从而使数学变得简单。Well, you're using
.LockBits
, so presumably you're using unsafe code. Rather than treating each row origin (Scan0 + y * Stride
) as abyte*
, consider treating it as anint*
;int
arithmetic is pretty quick, and you only have to do 1/4 as much work. And for images in ARGB you might still be talking in pixels, making the math simple.你能计算每个的哈希值并进行比较吗?这有点概率性,但实际上不是。
感谢 Ram,这里提供了此技术的示例实现。
Could you take a hash of each and compare? It would be slightly probabilistic, but practically not.
Thanks to Ram, here's a sample implementation of this technique.
如果最初的问题只是找到两个位图之间的精确重复项,那么只需进行位级比较即可。我不懂 C#,但在 CI 中会使用以下函数:
我会开始在中间查找,因为我怀疑在图像中间附近找到不相等位的机会比在开头处要好得多;当然,这实际上取决于您要删除重复的图像,选择随机位置开始可能是最好的。
如果您试图在数百张图像中找到精确的重复项,则无需比较所有图像对。首先计算每个图像的 MD5 哈希值并将其放入 (md5Hash, imageId) 对的列表中;然后按 m5Hash 对列表进行排序。接下来,仅对具有相同 md5Hash 的图像进行两两比较。
If the original problem is just to find the exact duplicates among two bitmaps, then just a bit level comparison will have to do. I don't know C# but in C I would use the following function:
I would start looking in the middle because I suspect there is a much better chance of finding unequal bits near the middle of the image than the beginning; of course, this would really depend on the images you are deduping, selecting a random place to start may be best.
If you are trying to find the exact duplicates among hundreds of images then comparing all pairs of them is unnecessary. First compute the MD5 hash of each image and place it in a list of pairs (md5Hash, imageId); then sort the list by the m5Hash. Next, only do pairwise comparisons on the images that have the same md5Hash.
如果这些位图已经在您的显卡上,那么您可以通过使用诸如 CUDA 或 OpenCL。
我会用 CUDA 来解释,因为这是我所知道的。基本上,CUDA 允许您编写通用代码以在显卡的每个节点上并行运行。您可以访问共享内存中的位图。该函数的每次调用还会在并行运行集中给出一个索引。因此,对于这样的问题,您只需对位图的某些子集运行上述比较函数之一 - 使用并行化来覆盖整个位图。然后,如果比较失败,则将 1 写入某个内存位置(如果成功,则不写入任何内容)。
如果您的显卡上还没有位图,这可能不是正确的方法,因为在卡上加载两个位图的成本很容易超过并行化为您带来的节省。
这是一些(非常糟糕的)示例代码(自从我编写 CUDA 以来已经有一段时间了)。有更好的方法来访问已经作为纹理加载的位图,但我在这里没有打扰。
If these bitmaps are already on your graphics card then you can parallelize such a check by doing it on the graphics card using a language like CUDA or OpenCL.
I'll explain in terms of CUDA, since that's the one I know. Basically CUDA lets you write general purpose code to run in parallel across each node of your graphics card. You can access bitmaps that are in shared memory. Each invocation of the function is also given an index within the set of parallel runs. So, for a problem like this, you'd just run one of the above comparison functions for some subset of the bitmap - using parallelization to cover the entire bitmap. Then, just write a 1 to a certain memory location if the comparison fails (and write nothing if it succeeds).
If you don't already have the bitmaps on your graphics card, this probably isn't the way to go, since the costs for loading the two bitmaps on your card will easily eclipse the savings such parallelization will gain you.
Here's some (pretty bad) example code (it's been a little while since I programmed CUDA). There's better ways to access bitmaps that are already loaded as textures, but I didn't bother here.
如果您可以用您的语言实现类似 Duff's Device 的功能,这可能会给您带来重大帮助比简单循环速度提升。通常它用于复制数据,但没有理由不能用于比较数据。
或者,就此而言,您可能只想使用与 memcmp() 等效的东西。
If you can implement something like Duff's Device in your language, that might give you a significant speed boost over a simple loop. Usually it's used for copying data, but there's no reason it can't be used for comparing data instead.
Or, for that matter, you may just want to use some equivalent to memcmp().
您可以尝试将它们添加到数据库“blob”中,然后使用数据库引擎来比较它们的二进制文件。对于二进制数据是否相同,这只会给出是或否的答案。制作两个产生相同图形但具有不同二进制的图像是非常容易的。
您还可以选择一些随机像素并比较它们,如果它们相同,则继续更多,直到检查完所有像素为止。这只会返回更快的负匹配,但仍然需要很长时间才能找到 100% 正匹配
You could try to add them to a database "blob" then use the database engine to compare their binaries. This would only give you a yes or no answer to whether the binary data is the same. It would be very easy to make 2 images that produce the same graphic but have different binary though.
You could also select a few random pixels and compare them, then if they are the same continue with more until you've checked all the pixels. This would only return a faster negative match though, it still would take as long to find 100% positive matches
基于比较哈希值而不是比较每个像素的方法,这就是我使用的方法:
用法很简单:
Based on the approach of comparing hashes instead of comparing every single pixel, this is what I use:
Usage is straight forward: