除了逐字节检查之外,还有哪些其他方法可以确定两个文件内容相同?

发布于 2024-08-02 02:05:01 字数 288 浏览 1 评论 0原文

逐字节比较肯定有效。 但我想知道是否还有其他经过验证的方法,比如某种为每个文件输出唯一值的散列。 如果有的话,每一种在时间和内存占用方面都有哪些优缺点。

顺便说一下,我发现了上一个线程 什么是检查文件是否相同的最快方法?。 然而,我的问题不是速度,而是替代方案。

请指教。 谢谢。

To compare byte by byte surely works. But I am wondering if there are any other proven way, say some kind of hashing that outputs unique values for each file. And if there are, what are the advantages and disadvantage of each one in terms of time and memory footprint.

By the way, I found this previous thread What is the fastest way to check if files are identical?. However, My question is not about speed, but alternatives.

Please advise. Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

拔了角的鹿 2024-08-09 02:05:01

唯一经过验证的方法是进行逐字节比较。 这也是最快的方法,如果一次读取一个字节,您可以将内存使用量一直减少到 2 个字节。 不过,一次读取较大的块对性能是有益的。

散列也将起作用。 由于鸽巢原理,您获得误报的可能性很小,但无论出于何种目的,如果您使用 SHA 等安全哈希,则可以忽略不计。 内存使用量也很小,但性能低于逐字节比较,因为您将有散列的开销。 除非您可以重复使用哈希值来进行多次比较。

The only proven way is to do a byte-by-byte compare. It's also the fastest way and you can cut the memory usage all the way down to 2 bytes if you read a byte at a time. Reading larger chunks at a time is beneficial for performance though.

Hashing will also work. Due to the pigeonhole principle there will be a small chance that you'll get false positives but for all intents and purposes it is negligible if you use a secure hash like SHA. Memory usage is also small, but performance is less than byte-by-byte compare because you'll have the overhead of hashing. Unless you can reuse the hashes to do multiple compares.

多情癖 2024-08-09 02:05:01

无论如何,如果你的文件有n个字节长度,你就必须比较n个字节,你不能让问题变得更简单。

仅当文件不相同时,您才能提高 n 次比较的速度,例如通过检查长度。

由于冲突,哈希不是一种经过验证的方法,并且要生成哈希,您还必须读取每个文件上的 n 个字节。

如果您想多次比较同一文件,可以使用散列,然后使用字节到字节进行双重检查

Anyway if your files are n bytes length, you have to compare n bytes, you can't make the problem simpler.

You can only gain speed on n comparisons when files are not identical, by checking length for exemple.

A hash is not a proven method because of collisions, and to make a hash you'll have to read n bytes on each file aswell.

If you want to compare the same file multiple times you can use hashing, then double check with a byte-to-byte

夜光 2024-08-09 02:05:01

散列不会输出“唯一”值。 它不可能这样做,因为有无数个不同的文件,但只有有限个哈希值。 不需要太多思考就可以意识到,要绝对确定两个文件是相同的,您将必须检查它们的所有字节。

哈希值和校验和可以提供快速的“这些文件是不同的”答案,并且在某些概率范围内可以提供快速的“这些文件可能是相同的”答案,但是为了确定相等性,您必须检查每个字节。 怎么可能有办法解决这个问题呢?

Hashing doesn't output 'unique' values. It can't possibly do so, because there are an infinite number of different files, but only a finite number of hash values. It doesn't take much thought to realise that to be absolutely sure two files are the same, you're going to have to examine all the bytes of both of them.

Hashes and checksums can provide a fast 'these files are different' answer, and within certain probabilistic bounds can provide a fast 'these files are probably the same' answer, but for certainty of equality you have to check every byte. How could there be any way round this?

心的位置 2024-08-09 02:05:01

如果您想比较多个文件,那么 SHA-1 哈希算法是一个非常好的选择。

If you want to compare multiple files then SHA-1 hash algorithm is a very good choice.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文