如何查找所有内容相同的文件?

发布于 2024-10-01 03:11:11 字数 325 浏览 3 评论 0原文

这是一个 面试问题:“给定一个包含大量文件的目录,找到具有相同内容的文件”。我建议使用哈希函数生成文件内容的哈希值,并仅比较具有相同哈希值的文件。有道理吗?

接下来的问题是如何选择哈希函数。您会为此目的使用 SHA-1 吗?

This is an interview question: "Given a directory with lots of files, find the files that have the same content". I would propose to use a hash function to generate hash values of the file contents and compare only the files with the same hash values. Does it make sense ?

The next question is how to choose the hash function. Would you use SHA-1 for that purpose ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

纸短情长 2024-10-08 03:11:11

我宁愿使用哈希作为第二步。首先按文件大小对目录进行排序,然后仅在存在重复大小时进行散列和比较,这可能会在一般情况下大大改善您的搜索范围。

I'd rather use the hash as a second step. Sorting the dir by file size first and hashing and comparing only when there are duplicate sizes may improve a lot your search universe in the general case.

初雪 2024-10-08 03:11:11

与大多数面试问题一样,它更多的是为了引发对话,而不是提供单一答案。

如果文件很少,那么简单地进行逐字节比较可能会更快,直到到达不匹配的字节(假设您这样做)。如果有很多文件,计算哈希值可能会更快,因为您不必在磁盘上从多个文件中分块读取数据。随着您逐步浏览文件以消除潜力,可以通过抓取每个文件中越来越大的块来加快此过程。如果文件足够多,也可能需要将问题分布到多个服务器上。

我会从比 SHA-1 更快、更简单的哈希函数开始。 SHA-1 具有加密安全性,但在本例中不一定需要。例如,在我的非正式测试中,Adler 32 的速度快 2-3 倍。您还可以使用更弱的推定测试,而不是重新测试任何匹配的文件。这个决定还取决于 IO 带宽和 CPU 功率之间的关系,如果您有更强大的 CPU,请使用更具体的哈希来节省在后续测试中重新读取文件的麻烦,如果您有更快的 IO,则重新读取可能比执行更便宜不必要的昂贵的哈希值。

另一个有趣的想法是在处理文件时使用启发式方法,根据文件大小、计算机速度和文件熵来确定最佳方法。

Like most interview questions, it's more meant to spark a conversation than to have a single answer.

If there are very few files, it may be faster to simply to a byte-by-byte comparison until you reach bytes which do not match (assuming you do). If there are many files, it may be faster to compute hashes, as you won't have to shift around the disk reading in chunks from multiple files. This process may be sped up by grabbing increasingly large chunks of each file, as you progress through the files eliminating potentials. hIt may also be necessary to distribute the problem among multiple servers, if their are enough files.

I would begin with a much faster and simpler hash function than SHA-1. SHA-1 is cryptographically secure, which is not necessarily required in this case. In my informal tests, Adler 32, for example, is 2-3 times faster. You could also use an even weaker presumptive test, than retest any files which match. This decision also depends on the relation between IO bandwidth and CPU power, if you have a more powerful CPU, use a more specific hash to save having to reread files in subsequent tests, if you have faster IO, the rereads may be cheaper than doing expensive hashes unnecessarily.

Another interesting idea would be to use heuristics on the files as you process them to determine the optimal method, based on the files size, computer's speed, and the file's entropy.

沉睡月亮 2024-10-08 03:11:11

是的,所提出的方法是合理的,SHA-1 或 MD5 足以完成该任务。这是针对同一场景的详细分析,这里是< a href="https://stackoverflow.com/questions/4032209/is-md5-still-good-enough-to-uniquely-identify-files">专门关于使用 MD5 的问题。不要忘记您需要尽可能快的哈希函数。

Yes, the proposed approach is reasonable and SHA-1 or MD5 will be enough for that task. Here's a detailed analysis for the very same scenario and here's a question specifically on using MD5. Don't forget you need a hash function as fast as possible.

天暗了我发光 2024-10-08 03:11:11

是的,哈希是第一个想到的。对于您的特定任务,您需要采用最快的哈希函数。 Adler32 可以工作。在您的情况下,冲突不是问题,因此您不需要加密功能强大的功能。

Yes, hashing is the first that comes to mind. For your particular task you need to take the fastest hash function available. Adler32 would work. Collisions are not a problem in your case, so you don't need cryptographically strong function.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文