如何高效识别二进制文件

发布于 2024-09-14 04:40:30 字数 284 浏览 21 评论 0原文

识别二进制文件最有效的方法是什么?我想从二进制文件中提取某种签名,并用它与其他文件进行比较。

暴力方法是使用整个文件作为签名,这将花费太长的时间和太多的内存。我正在寻找一种更聪明的方法来解决这个问题,并且我愿意为了性能而牺牲一点准确性(但不是太多,嗯)。

(虽然首选 Java 代码示例,但鼓励与语言无关的答案)

编辑:扫描整个文件以创建散列的缺点是文件越大,所需时间越长。由于哈希无论如何都不是唯一的,我想知道是否有更有效的方法(即:来自均匀分布的字节采样的哈希)。

What's the most efficient way to identify a binary file? I would like to extract some kind of signature from a binary file and use it to compare it with others.

The brute-force approach would be to use the whole file as a signature, which would take too long and too much memory. I'm looking for a smarter approach to this problem, and I'm willing to sacrifice a little accuracy (but not too much, ey) for performance.

(while Java code-examples are preferred, language-agnostic answers are encouraged)

Edit: Scanning the whole file to create a hash has the disadvantage that the bigger the file, the longer it takes. Since the hash wouldn't be unique anyway, I was wondering if there was a more efficient approach (ie: a hash from an evenly distributed sampling of bytes).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

盗琴音 2024-09-21 04:40:30

我发现对此类事情有效的一种方法是计算两个 SHA-1 哈希值。一个用于文件中的第一个块(我任意选择 512 字节作为块大小),一个用于整个文件。然后我存储了两个哈希值以及文件大小。当我需要识别一个文件时,我会首先比较文件长度。如果长度匹配,那么我将比较第一个块的哈希值,如果匹配,我将比较整个文件的哈希值。前两次测试很快就清除了许多不匹配的文件。

An approach I found effective for this sort of thing was to calculate two SHA-1 hashes. One for the first block in a file (I arbitrarily picked 512 bytes as a block size) and one for the whole file. I then stored the two hashes along with a file size. When I needed to identify a file I would first compare the file length. If the lengths matched then I would compare the hash of the first block and if that matched I compared the hash of the entire file. The first two tests quickly weeded out a lot of non-matching files.

夏末 2024-09-21 04:40:30

这就是散列的用途。请参阅 MessageDigest

请注意,如果您的文件太大而无法在内存中读取,那也没关系,因为您可以将文件的块提供给哈希函数。例如,MD5 和 SHA1 可以采用 512 位的块。

此外,具有相同哈希值的两个文件不一定相同(尽管这种情况很少见),但两个相同的文件必然具有相同的哈希值。

That's what hashing is for. See MessageDigest.

Note that if your file is too big to be read in memory, that's OK because you can feed chunks of the file to the hash function. MD5 and SHA1 for example can take blocks of 512 bits.

Also, two files with the same hash aren't necessarily identical (it's very rare that they aren't though), but two files that are identical have necessarily the same hash.

半世晨晓 2024-09-21 04:40:30

通常的答案是使用 MD5,但我想建议在现代应用程序中使用 MD5 的冲突太多: http://www.mscs.dal.ca/~selinger/md5collision/

SHA-1 十多年前就取代了 MD5。

NIST 在 2005 年建议在 2010 年之前使用 SHA-2 代替 SHA-1,因为已经完成了证明 SHA-1 简化变体中的冲突的工作。 (这是相当好的远见,因为现在知道需要 2^51理想情况下需要 2^80 工作来查找冲突。)

因此,请根据您想要执行的操作以及您可能需要与哪些其他程序进行互操作,在 MD5 中进行选择(请不要), SHA-1(我能理解,但我们可以做得更好)和 SHA-2(选我!选我!)。

The usual answer is to use MD5, but I'd like to suggest that there are too many collisions to use MD5 in modern applications: http://www.mscs.dal.ca/~selinger/md5collision/

SHA-1 replaced MD5 over a decade ago.

NIST recommended in 2005 that SHA-2 should be used in place of SHA-1 by the year 2010, because of work that had been done to demonstrate collisions in reduced variants of SHA-1. (Which is pretty good foresight, since it is now known that it takes 2^51 work to find collisions in what should ideally require 2^80 work to find collisions.)

So please, based on what you're trying to do, and which other programs you may need to interoperate with, select among MD5 (please no), SHA-1 (I'd understand, but we can do better), and SHA-2 (pick me! pick me!).

飘逸的'云 2024-09-21 04:40:30

您是否考虑使用标头标识。
如果你能以这种方式设计你的文件,这将是快速和可靠的。
使用 1 个字节可以区分 255 种文件类型;)

Are you taking into account to use header identification.
If you can design your files in such way, this would be fast and reliable.
Using one byte you can distinguish 255 file types ;)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文