如何使用.NET快速比较2个文件?
典型方法建议通过 FileStream 读取二进制文件并逐字节进行比较。
- CRC 之类的校验和比较会更快吗?
- 是否有任何 .NET 库可以为文件生成校验和?
Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.
- Would a checksum comparison such as CRC be faster?
- Are there any .NET libraries that can generate a checksum for a file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(21)
最慢的方法是逐字节比较两个文件。我能想到的最快的方法是类似的比较,但不是一次一个字节,而是使用大小为 Int64 的字节数组,然后比较结果数字。
以下是我的想法:
在我的测试中,我发现它的性能比简单的 ReadByte() 场景快了 3:1。平均运行超过 1000 次,我得到这个方法的时间为 1063 毫秒,而下面的方法(直接逐字节比较)的时间为 3031 毫秒。散列总是以亚秒级的速度返回,平均时间约为 865 毫秒。此测试使用约 100MB 的视频文件。
以下是我使用的 ReadByte 和哈希方法,用于比较:
The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.
Here's what I came up with:
In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.
Here's the ReadByte and hashing methods I used, for comparison purposes:
校验和比较很可能比逐字节比较慢。
为了生成校验和,您需要加载文件的每个字节,并对其执行处理。然后您必须对第二个文件执行此操作。处理几乎肯定会比比较检查慢。
至于生成校验和:您可以使用加密类轻松完成此操作。下面是一个使用 C# 生成 MD5 校验和的简短示例。
但是,如果您可以预先计算“测试”或“基本”情况的校验和,则校验和可能会更快并且更有意义。如果您有一个现有文件,并且要检查新文件是否与现有文件相同,则在“现有”文件上预先计算校验和意味着只需要在磁盘上执行一次 DiskIO新文件。这可能比逐字节比较更快。
A checksum comparison will most likely be slower than a byte-by-byte comparison.
In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.
As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.
However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.
如果您d̲o̲决定确实需要完整的逐字节比较(请参阅其他答案以了解散列的讨论),那么最简单的解决方案是:
• for `System.String` path names:
• for `System.IO.FileInfo` instances:
与其他一些发布的答案不同,这对于任何类型的文件:二进制、文本、媒体、可执行文件等都是绝对正确的,但作为完整二进制比较, 仅在“不重要”方面有所不同的文件(例如BOM , 行尾, 字符编码, 媒体元数据,空格、填充、源代码注释等注释 1)将始终被视为不等于。
此代码将两个文件完全加载到内存中,因此它不应该用于比较真正巨大的文件。除了这一重要的警告之外,考虑到 .NET GC(因为它从根本上进行了优化以保持较小的规模,短期分配非常便宜),事实上,当文件大小预计小于85K,因为使用最少的用户代码(如此处所示)意味着最大限度地委托文件性能
CLR
的问题,BCL
和JIT
从(例如)最新的设计技术、系统代码和自适应运行时优化。此外,对于此类日常场景,通过
LINQ
枚举器(如此处所示)进行逐字节比较的性能的担忧是没有意义的,因为击中磁盘a̲t̲ a̲l̲l̲文件 I/O 将使各种内存比较替代方案的优势相形见绌几个数量级。例如,尽管SequenceEqual
确实 实际上为我们提供了放弃第一次不匹配的“优化”,但在已经获取了文件的内容,每个内容对于任何真阳性案例都是完全必要的。1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page, so such streams may be different for files otherwise reported as the "same."
If you d̲o̲ decide you truly need a full byte-by-byte comparison (see other answers for discussion of hashing), then the easiest solution is:
• for `System.String` path names:
• for `System.IO.FileInfo` instances:
Unlike some other posted answers, this is conclusively correct for any kind of file: binary, text, media, executable, etc., but as a full binary comparison, files that that differ only in "unimportant" ways (such as BOM, line-ending, character encoding, media metadata, whitespace, padding, source code comments, etc.note 1) will always be considered not-equal.
This code loads both files into memory entirely, so it should not be used for comparing truly gigantic files. Beyond that important caveat, full loading isn't really a penalty given the design of the .NET GC (because it's fundamentally optimized to keep small, short-lived allocations extremely cheap), and in fact could even be optimal when file sizes are expected to be less than 85K, because using a minimum of user code (as shown here) implies maximally delegating file performance issues to the
CLR
,BCL
, andJIT
to benefit from (e.g.) the latest design technology, system code, and adaptive runtime optimizations.Furthermore, for such workaday scenarios, concerns about the performance of byte-by-byte comparison via
LINQ
enumerators (as shown here) are moot, since hitting the disk a̲t̲ a̲l̲l̲ for file I/O will dwarf, by several orders of magnitude, the benefits of the various memory-comparing alternatives. For example, even thoughSequenceEqual
does in fact give us the "optimization" of abandoning on first mismatch, this hardly matters after having already fetched the files' contents, each fully necessary for any true-positive cases.1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page, so such streams may be different for files otherwise reported as the "same."
除了Reed Copsey的回答之外:
最坏的情况是两个文件相同。在这种情况下,最好逐字节比较文件。
如果两个文件不相同,您可以通过更快地检测到它们不相同来加快速度。
例如,如果两个文件的长度不同,那么您就知道它们不可能相同,甚至不必比较它们的实际内容。
In addition to Reed Copsey's answer:
The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.
If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.
For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.
如果您不读取小 8 字节块,而是放置一个循环,读取更大的块,速度会变得更快。我将平均比较时间减少到 1/4。
It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.
编辑:此方法不适用于比较二进制文件!
在 .NET 4.0 中,
File
类具有以下两个新方法:这意味着您可以使用:
Edit: This method would not work for comparing binary files!
In .NET 4.0, the
File
class has the following two new methods:Which means you could use:
唯一可能使校验和比较比逐字节比较稍快的事实是您一次读取一个文件,这在一定程度上减少了磁盘头的寻道时间。然而,这一微小的增益很可能会被计算哈希值的增加时间所消耗。
此外,如果文件相同,校验和比较当然才有可能更快。如果不是,逐字节比较将在第一个差异处结束,从而使其速度更快。
您还应该考虑到哈希码比较只会告诉您文件很可能是相同的。为了 100% 确定,您需要进行逐字节比较。
例如,如果哈希码是 32 位,则如果哈希码匹配,则大约 99.99999998% 确定文件是相同的。这接近 100%,但如果您确实需要 100% 的确定性,那就不是这样了。
The only thing that might make a checksum comparison slightly faster than a byte-by-byte comparison is the fact that you are reading one file at a time, somewhat reducing the seek time for the disk head. That slight gain may however very well be eaten up by the added time of calculating the hash.
Also, a checksum comparison of course only has any chance of being faster if the files are identical. If they are not, a byte-by-byte comparison would end at the first difference, making it a lot faster.
You should also consider that a hash code comparison only tells you that it's very likely that the files are identical. To be 100% certain you need to do a byte-by-byte comparison.
If the hash code for example is 32 bits, you are about 99.99999998% certain that the files are identical if the hash codes match. That is close to 100%, but if you truly need 100% certainty, that's not it.
我的答案是 @lars 的衍生版本,但修复了对 Stream.Read 的调用中的错误。我还添加了其他答案的一些快速路径检查和输入验证。简而言之,这应该是答案:
或者如果你想变得超级棒,你可以使用异步变体:
My answer is a derivative of @lars but fixes the bug in the call to
Stream.Read
. I also add some fast path checking that other answers had, and input validation. In short, this should be the answer:Or if you want to be super-awesome, you can use the async variant:
老实说,我认为您需要尽可能地修剪搜索树。
在逐字节检查之前要检查的事项:
另外,一次读取大块会更有效,因为驱动器读取连续字节的速度更快。逐字节进行不仅会导致更多的系统调用,而且如果两个文件位于同一驱动器上,还会导致传统硬盘驱动器的读取头更频繁地来回查找。
将块 A 和块 B 读入字节缓冲区,然后比较它们(不要使用 Array.Equals,请参阅注释)。调整块的大小,直到达到您认为内存和性能之间的良好权衡。您还可以多线程比较,但不要多线程磁盘读取。
Honestly, I think you need to prune your search tree down as much as possible.
Things to check before going byte-by-byte:
Also, reading large blocks at a time will be more efficient since drives read sequential bytes more quickly. Going byte-by-byte causes not only far more system calls, but it causes the read head of a traditional hard drive to seek back and forth more often if both files are on the same drive.
Read chunk A and chunk B into a byte buffer, and compare them (do NOT use Array.Equals, see comments). Tune the size of the blocks until you hit what you feel is a good trade off between memory and performance. You could also multi-thread the comparison, but don't multi-thread the disk reads.
灵感来自 https:// dev.to/emrahsungu/how-to-compare-two-files-using-net-really-really-fast-2pd9
以下是使用 AVX2 SIMD 指令执行此操作的建议:
Inspired from https://dev.to/emrahsungu/how-to-compare-two-files-using-net-really-really-fast-2pd9
Here is a proposal to do it with AVX2 SIMD instructions:
如果文件不太大,您可以使用:
只有当哈希值对于存储有用时才可以比较哈希值。
(将代码编辑为更简洁的内容。)
If the files are not too big, you can use:
It will only be feasible to compare hashes if the hashes are useful to store.
(Edited the code to something much cleaner.)
我的实验表明,减少调用 Stream.ReadByte() 次数肯定有帮助,但使用 BitConverter 打包字节与比较字节数组中的字节没有太大区别。
因此,可以用最简单的循环替换上面注释中的“Math.Ceiling and iterations”循环:
我想这与 BitConverter.ToInt64 需要做一些工作这一事实有关(检查参数然后执行位移),然后进行比较,最终的工作量与比较两个数组中的 8 个字节的工作量相同。
My experiments show that it definitely helps to call Stream.ReadByte() fewer times, but using BitConverter to package bytes does not make much difference against comparing bytes in a byte array.
So it is possible to replace that "Math.Ceiling and iterations" loop in the comment above with the simplest one:
I guess it has to do with the fact that BitConverter.ToInt64 needs to do a bit of work (check arguments and then perform the bit shifting) before you compare and that ends up being the same amount of work as compare 8 bytes in two arrays.
对具有相同长度的大文件的另一个改进可能是不按顺序读取文件,而是比较或多或少的随机块。
您可以使用多个线程,从文件中的不同位置开始并向前或向后进行比较。
通过这种方式,您可以检测文件中间/末尾的更改,比使用顺序方法更快。
Another improvement on large files with identical length, might be to not read the files sequentially, but rather compare more or less random blocks.
You can use multiple threads, starting on different positions in the file and comparing either forward or backwards.
This way you can detect changes at the middle/end of the file, faster than you would get there using a sequential approach.
如果你只需要比较两个文件,我想最快的方法是(在C中,我不知道它是否适用于.NET)
OTOH,如果您需要查找一组 N 个文件中是否存在重复文件,那么最快的方法无疑是使用哈希来避免 N-way一点一点的比较。
If you only need to compare two files, I guess the fastest way would be (in C, I don't know if it's applicable to .NET)
OTOH, if you need to find if there are duplicate files in a set of N files, then the fastest way is undoubtedly using a hash to avoid N-way bit-by-bit comparisons.
我认为在某些应用程序中,“散列”比逐字节比较更快。
如果您需要将文件与其他文件进行比较或有可以更改的照片缩略图。
这取决于它在哪里以及如何使用。
在这里,您可以获得最快的。
或者,我们可以将哈希值保存在数据库中。
希望这可以帮助
I think there are applications where "hash" is faster than comparing byte by byte.
If you need to compare a file with others or have a thumbnail of a photo that can change.
It depends on where and how it is using.
Here, you can get what is the fastest.
Optionally, we can save the hash in a database.
Hope this can help
以下是一些实用函数,可让您确定两个文件(或两个流)是否包含相同的数据。
我提供了一个多线程的“快速”版本,因为它使用任务在不同线程中比较字节数组(每个缓冲区从每个文件中读取的内容填充)。
正如预期的那样,它的速度要快得多(大约快 3 倍),但会消耗更多的 CPU(因为它是多线程的)和更多的内存(因为每个比较线程需要两个字节数组缓冲区)。
Here are some utility functions that allow you to determine if two files (or two streams) contain identical data.
I have provided a "fast" version that is multi-threaded as it compares byte arrays (each buffer filled from what's been read in each file) in different threads using Tasks.
As expected, it's much faster (around 3x faster) but it consumes more CPU (because it's multi threaded) and more memory (because it needs two byte array buffers per comparison thread).
我发现这很有效,首先比较长度而不读取数据,然后比较读取的字节序列
This I have found works well comparing first the length without reading data and then comparing the read byte sequence
另一个答案,来自@chsh。 MD5 的用法和快捷方式适用于文件相同、文件不存在和不同长度:
Yet another answer, derived from @chsh. MD5 with usings and shortcuts for file same, file not exists and differing lengths:
不是一个真正的答案,但有点有趣。
这是 github 的 CoPilot (AI) 建议的 :-)
我发现
SequenceEqual
的用法特别有趣。Not really an answer, but kinda funny.
This is what github's CoPilot (AI) suggested :-)
I find the usage of
SequenceEqual
particular interesting.(希望)相当有效的东西:
Something (hopefully) reasonably efficient:
我喜欢上面的 SequenceEqual 答案,但哈希比较答案看起来非常混乱。我更喜欢这样的哈希比较:
I liked the SequenceEqual answers above, but the hash comparison answers looked very messy. I prefer a hash comparison more like this: