比较大型文本文件 - 比较哈希值是否比使用文件子集更快?

发布于 2024-12-08 05:32:58 字数 415 浏览 0 评论 0原文

假设我有两个大(文本)文件,据称它们是相同的,但我想确定一下。整个哈利波特系列的“成人”和“儿童”版本也许......

如果全文的字符串表示太大而无法一次保存在内存中,那么是否会更快:

  • a) 对两个文件进行完整哈希处理,然后测试哈希值是否相同

,或者

  • b) 读取每个文件的可管理块并进行比较,直到达到 EOF 或发现不匹配

换句话说,比较 2 是否方便小散列是偏移生成所述哈希值所需的时间?

我期待几个“这取决于”答案,所以如果你想要一些假设:

  • 语言是 .NET 中的 C#
  • 文本文件是 3GB 每个
  • 哈希函数是 MD5
  • 最大“备用”RAM 是 1GB

Say I have two large (text) files which are allegedly identical, but I want to make sure. The entire Harry Potter series of 'adult' and 'child' editions perhaps...

If the full text's string representation is too large to be held in memory at once, is it going to be faster to:

  • a) Hash both files in their entirety and then test to see if the hashes are identical

or

  • b) Read in manageable chunks of each file and compare them until you either reach EOF or find a mismatch

In other words, would the convenience of comparing 2 small hashes be offset by the time it took to generate said hashes?

I'm expecting a couple of "it depends" answers, so if you want some assumtions to work with:

  • Language is C# in .NET
  • Text files are 3GB each
  • Hash function is MD5
  • Maximum 'spare' RAM is 1GB

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

-柠檬树下少年和吉他 2024-12-15 05:32:58
  1. MD5 校验和会比较慢,因为您需要处理两个文件才能获得结果。你说你有 3GB 文件,只有 1GB 空闲内存,你可以进行数学计算。

  2. 在字节块中检查它们实际上会更早地确定任何差异,也可以通过检查文件大小、文件长度等...

我会选择选项 2。

  1. The MD5 Checksum will be slower since you need to process the two files to get the outcome. You say you have 3GB files and only 1GB of memory spare you do the math.

  2. Checking them in byte chunks will actually determine any difference earlier, also by checking the file size, file length etc...

I would go with option 2.

听风念你 2024-12-15 05:32:58

假设您将来不再使用哈希信息(与其他文本进行比较,或在未来可能发生更改后进行检查),那么有两种情况:
A)文件相同
B) 文档不同

如果是 A,那么这两种情况几乎没有区别。两者都涉及一次读取整个文件的一个块,并对每个字节进行计算/比较。与读取文件的工作相比,哈希的计算开销是最小的。

如果是 B,那么您可能会在文件的第一页中发现差异,此时您可以退出该进程。

因此,根据 A v B 的相对概率,平均而言比较似乎会更快。另请注意,您可以报告更改发生的位置,而在哈希场景中则无法报告。

Assuming you have no future use for the hash information (to compare against other texts, or to check after potential future changes), then there's two cases:
A) documents are same
B) documents are different

If A, then there's almost no difference between the two scenarios. Both involve reading the entire files one chunk at a time and doing a calculation/compare on every byte. The computational overhead of the hash is minimal compared to the work of reading the files.

If B, then it's possible you'd find a difference in the first page of the files, at which point you'd be able to quit the process.

So depending on the relative probability of A v B, it seems comparing would be faster on average. Note also that you could then report where the change occurs, which you could not in the hash scenario.

森罗 2024-12-15 05:32:58

选项 A 仅在您重用哈希(即有其他文件进行比较)时才有用,这样计算哈希的成本就不是一个因素...

否则选项 B 就是我想要的...

为了获得最大值速度我会使用 MemoryMappedFile 实例和 XOR内容——比较可以在第一次遇到差异时停止(即 XOR 运算返回某些内容!= 0)。关于内存消耗,您可以使用“移动窗口”(即通过调用 CreateViewAccessor ),这将允许逐字处理 TB 大小的文件...

甚至值得测试 XOR 的性能与一些基于 LINQ 的比较方法相比...并且始终从比较文件大小开始,这样您就可以避免进行不必要的计算...

Option A is only useful if you reuse the hash (i.e. have other files to compare) so that the cost of calculating the hash isn't a factor...

Otherwise Option B is what i would go for...

To get the maximum speed I would use MemoryMappedFile instances and XOR the content - the comparison can stop at the first encounter of a difference (i.e. the XOR operation returns something != 0). Regarding memory consumption you can use a "moving window" (i.e. via the call to CreateViewAccessor) which would allow for literally processing files of TB-size...

It could even be worth to test performance of XOR against some LINQ based comparison methods... and always start by comparing the file sizes, this way you avoid doing unnecessary calculations...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文