判断字符串/文件是否已更改的有效方法 - crc32? md5?还有什么吗?
我正在寻找一种有效的方法来判断字符串(或文件)自上次查看以来是否已更改。
因此,我们针对 1,000,000 个文件/字符串(每个文件/字符串小于 1000 字节)运行此函数,并存储每个文件/字符串的输出。
然后我会等待几天并再次运行它。我需要查明每个文件是否已更改...
我应该为每个文件计算 CRC32 吗? MD5?还有什么更有效的吗?
CRC32 是否足以告诉我文件/字符串是否已更改?
编辑它必须同时处理文件和字符串,因此文件上的时间戳是不可能的。
I'm looking for an efficient way to tell whether or not a string (or a file) has changed since the last time we looked at it.
So, we run this function against 1,000,000 files/strings (each file/string is less than 1000 bytes), and store the output for each file/string.
I'll then wait a few days and run this again. I need to find out whether or not each file has changed or not...
Should I calculate CRC32s for each file? MD5? Something else more efficient?
Is CRC32 good enough for telling me whether or not a file/string has changed?
EDIT It has to work both both files and strings, so timestamps on the files are out of the question.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
对于文件来说,一定要看内容吗?文件系统将跟踪修改的时间戳。
For files, do you have to look at the content? The filesystem will track a modified timestamp.
CRC32 或 CRC64 就可以很好地完成这项工作。
您甚至可以使用它作为某种哈希查找的基础。
CRC32, or CRC64 will do the job just fine.
You might even be able to use it as a basis for some sort of hash lookup.
对于文件,您可以使用时间戳。
对于字符串,您可以保留备份副本。
只需比较它们并重写备份就可能与 CRC 或 MD5 一样快。
For the files you could use the timestamp.
For the strings, you could keep a backup copy.
Just comparing them and re-writing the backup might be as fast as CRC or MD5.
字符串比较将比 crc32 或 md5 或任何其他建议的哈希算法更有效。
对于初学者来说,一旦两个字符串不同,您就可以退出字符串比较,而使用哈希算法,您必须先对文件的整个内容进行哈希处理,然后才能进行比较。
此外,散列算法必须执行一些操作来生成散列,而字符串比较则检查两个值之间的相等性。
我想对第一次失败时短路的文件/字符串(每个文件/字符串)进行基于字符串的比较将为您带来良好的性能。
String comparison will be more efficient than either crc32 or md5, or any other hash algorithm proposed.
For starters you can bail out of a string comparison as soon as the two strings are different, whereas with a hashing algorithm you have to hash the entire contents of the file before you can make a comparison.
What is more, hashing algorithms have operations they must perform to generate the hash, whereas a string comparison is checking for equality between two values.
I'd imagine a string-based comparison of the files/strings that short-circuits on the first failure (per-file/string) will get you good performance.
您说数据约为一百万个 1kB 字符串/文件,并且您希望每隔几天检查一次。如果这是真的你真的不必担心性能,因为处理1GB的数据不会花那么长时间,无论你使用crc32还是md5都没有关系。
我建议使用 md5,因为它比 crc32 冲突的可能性更小。 Crc32 可以完成这项工作,但您无需投入更多即可获得更好的结果。
编辑:
正如其他人所说,将字符串与备份进行比较更快。 (因为一旦两个字符不同,您就可以中止)如果您必须从文件中读取字符串,则这并不是 100% 正确。如果我们假设字符串来自文件并且您使用 md5,则必须读取 32 个字节加上要比较的每个字符串的字符串长度的平均值。当您逐个字节进行比较时,您必须读取最少 2 个字节,最多读取字符串长度的两倍。因此,如果许多字符串具有相同的开头(字符数超过 32 + 字符串长度的平均值相等),则使用哈希会更快。 (如果我错了,请纠正我)因为这是一个理论案例,所以你可以坚持逐个字符比较。如果字符串长度的平均值大于 32 个字节,则使用哈希时将节省磁盘空间;-)。
但正如我上面已经说过的;处理大量数据时,性能不会成为您的问题。
You said the data whould be around one million 1kB strings/files and you want to check it every few days. If this is true you really don't have to worry about performance, because processing 1GB of data won't take that long, it doesn't matter if you use crc32 or md5.
I suggest using md5, because it's less likely to collide than crc32. Crc32 will do the job, but you can get a better result without investing much more.
Edit:
As someone else stated comparing the strings to a backup is faster. (Because you can abort as soon as two chars differ) This is not 100% true if you have to read the String out of a file. If we assume that the strings come out of files and you use md5 you'll have to read 32 bytes plus the average of the string lengths for every string you want to compare. When you compare byte by byte you'll have to read in minimum 2 bytes and in maximum tow times the string length. So if many of your strings have the same beginning (more chars than 32 + the average of the string lengths are equal) you'll be faster with a hash. (Correct me if I'm wrong) Because this is a theoretical case you'll be fine to stick with a char by char comparison. If the average of the string lengths is bigger than 32 bytes, you'll save disk space when using a hash ;-).
But as I already stated above; performance won't be your problem when dealing with that ammout of data.
在Java中你可以这样做:
In Java you can do:
我使用 MD5 来处理这种类型的事情,看起来效果很好。
如果您使用的是 .NET,请参阅 System.Security.Cryptography.MD5CryptoServiceProvider。
I use MD5 for this type of thing, seems to work well enough.
If you're using .NET, see System.Security.Cryptography.MD5CryptoServiceProvider.
为了完整性:CRC32 和 MD5 可能会告诉字符串没有更改,而实际上它已更改(因为存在具有相同 CRC32 或 MD5 的唯一字符串)。
For completeness: CRC32 and MD5 may tell a string has not changed when, in fact, it has (because there exist unique strings with the same CRC32 or MD5).