更快的 MD5 替代品?
我正在开发一个程序,可以在整个驱动器中搜索给定的文件。 目前,我计算已知文件的 MD5 哈希值,然后递归扫描所有文件,寻找匹配项。
唯一的问题是 MD5 在处理大文件时速度非常慢。 是否有一种更快的替代方案可供我使用,同时保留极小的误报概率?
所有代码均采用 C# 语言。
谢谢。
更新
我读到,即使 MD5 也可以非常快,磁盘 I/O 应该是限制因素。 这让我相信我的代码可能不是最优的。 这种方法有什么问题吗?
MD5 md5 = MD5.Create();
StringBuilder sb = new StringBuilder();
try
{
using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read))
{
foreach (byte b in md5.ComputeHash(fs))
sb.Append(b.ToString("X2"));
}
return sb.ToString();
}
catch (Exception)
{
return "";
}
I'm working on a program that searches entire drives for a given file. At the moment, I calculate an MD5 hash for the known file and then scan all files recursively, looking for a match.
The only problem is that MD5 is painfully slow on large files. Is there a faster alternative that I can use while retaining a very small probablity of false positives?
All code is in C#.
Thank you.
Update
I've read that even MD5 can be pretty quick and that disk I/O should be the limiting factor. That leads me to believe that my code might not be optimal. Are there any problems with this approach?
MD5 md5 = MD5.Create();
StringBuilder sb = new StringBuilder();
try
{
using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read))
{
foreach (byte b in md5.ComputeHash(fs))
sb.Append(b.ToString("X2"));
}
return sb.ToString();
}
catch (Exception)
{
return "";
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我希望您仅在文件大小已经匹配时才检查 MD5 匹配。
另一种优化是对前 1K(或其他任意但相当小的数字)进行快速校验和,并在处理整个文件之前确保它们匹配。
当然,所有这些都假设您只是在寻找特定文件的匹配/不匹配决策。
I hope you're checking for an MD5 match only if the file size already matches.
Another optimization is to do a quick checksum of the first 1K (or some other arbitrary, but reasonably small number) and make sure those match before working the whole file.
Of course, all this assumes that you're just looking for a match/nomatch decision for a particular file.
无论加密要求如何,都存在哈希冲突的可能性,因此无法使用哈希函数来保证两个文件相同。
我不久前编写了类似的代码,通过首先索引所有文件并丢弃任何具有不同大小的文件,我可以运行得相当快。 然后对剩余条目执行快速哈希比较(针对每个文件的一部分)(事实证明,此步骤的比较字节不太有用 - 许多文件类型具有公共标头,这些标头在文件开头具有相同的字节)。 然后使用 MD5 检查此阶段后留下的所有文件,如果 MD5 匹配,最后对整个文件进行字节比较,以确保内容相同。
Regardless of cryptographic requirements, the possibility of a hash collision exists, so no hashing function can be used to guarantee that two files are identical.
I wrote similar code a while back which I got to run pretty fast by indexing all the files first, and discarding any with a different size. A fast hash comparison (on part of each file) was then performed on the remaining entries (comparing bytes for this step was proved to be less useful - many file types have common headers which have identical bytes at the start of the file). Any files that were left after this stage were then checked using MD5, and finally a byte comparison of the whole file if the MD5 matched, just to ensure that the contents were the same.
只是线性读取文件? 读取整个文件、计算 md5 哈希值,然后比较哈希值似乎毫无意义。
按顺序读取文件(一次读取几个字节)将允许您在读取(例如 4 个字节)后丢弃绝大多数文件。 而且您可以节省计算哈希函数的所有处理开销,而在您的情况下,该函数不会为您提供任何信息。
如果您已经拥有驱动器中所有文件的哈希值,则比较它们是有意义的,但如果您必须动态计算它们,则哈希值似乎没有任何优势。
我在这里错过了什么吗? 在这种情况下,散列会给你带来什么?
just read the file linearly? It seems pretty pointless to read the entire file, compute a md5 hash, and then compare the hash.
Reading the file sequentially, a few bytes at a time, would allow you to discard the vast majority of files after reading, say, 4 bytes. And you'd save all the processing overhead of computing a hashing function which doesn't give you anything in your case.
If you already had the hashes for all the files in the drive, it'd make sense to compare them, but if you have to compute them on the fly, there just doesn't seem to be any advantage to the hashing.
Am I missing something here? What does hashing buy you in this case?
首先考虑真正的瓶颈是什么:哈希函数本身还是磁盘访问速度? 如果你受到磁盘的限制,改变哈希算法不会给你带来太多好处。 根据您的描述,我暗示您总是扫描整个磁盘以查找匹配项 - 考虑首先构建索引,然后仅将给定的哈希与索引匹配,这会快得多。
First consider what is really your bottleneck: the hash function itself or rather a disk access speed? If you are bounded by disk, changing hashing algorithm won't give you much. From your description I imply that you are always scanning the whole disk to find a match - consider building the index first and then only match a given hash against the index, this will be much faster.
使用 MD5 比较文件有一个小问题:已知有一对文件不同,但具有相同 MD5。
这意味着您可以使用 MD5 来判断文件是否不同(如果 MD5 不同,则文件一定不同),但不能使用 MD5 来判断文件是否相等< /em> (如果文件相等,则 MD5 必须相同,但如果 MD5 相等,则文件可能相等也可能不相等)。
您应该使用尚未被破坏的哈希函数(如 SHA-1),或者(如 @SoapBox 提到的)仅使用 MD5 作为查找候选者进行更深入比较的快速方法。
参考文献:
There is one small problem with using MD5 to compare files: there are known pairs of files which are different but have the same MD5.
This means you can use MD5 to tell if the files are different (if the MD5 is different, the files must be different), but you cannot use MD5 to tell if the files are equal (if the files are equal, the MD5 must be the same, but if the MD5 is equal, the files might or might not be equal).
You should either use a hash function which has not been broken yet (like SHA-1), or (as @SoapBox mentioned) use MD5 only as a fast way to find candidates for a deeper comparison.
References:
使用 MD5CryptoServiceProvider 和 BufferedStream
Use MD5CryptoServiceProvider and BufferedStream