逐字节比较文件或读取所有字节?
我遇到了这段代码 http://support.microsoft.com/kb/320348 这让我想知道比较两个文件以确定它们是否不同的最佳方法是什么。
主要思想是优化我的程序,该程序需要验证是否有任何文件相等或不创建已更改文件和/或要删除/创建的文件的列表。
目前,我正在比较文件的大小,如果它们匹配,我将进入这 2 个文件的 md5 校验和,但是在查看了这个问题开头链接的代码后,我想知道是否真的值得使用它创建两个文件的校验和(基本上是在获得所有字节之后)?
另外,我还应该进行哪些其他验证来减少检查每个文件的工作?
I came across this code http://support.microsoft.com/kb/320348 which made me wonder what would be the best way to compare 2 files in order to figure out if they differ.
The main idea is to optimize my program which needs to verify if any file is equal or not to create a list of changed files and/or files to delete / create.
Currently I am comparing the size of the files if they match i will go into a md5 checksum of the 2 files, but after looking at that code linked at the begin of this question it made me wonder if it is really worth to use it over creating a checksum of the 2 files (which is basically after you get all the bytes) ?
Also what other verifications should I make to reduce the work in check each file ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
将两个文件读入针对读取进行优化的小型缓冲区(4K 或 8K),然后比较针对比较进行优化的内存缓冲区(逐字节) 。
这将为您在所有情况下提供最佳性能(差异在于开始、中间或结束时)。
当然,第一步是检查文件长度是否不同,如果是这样,文件确实不同。
Read both files into a small buffer (4K or 8K) which is optimised for reading and then compare buffers in memory (byte by byte) which is optimised for comparing.
This will give you optimum performance for all cases (where difference is at the start, middle or the end).
Of course first step is to check if file length differs and if that's the case, files are indeed different..
如果您还没有计算文件的哈希值,那么您不妨进行适当的比较(而不是查看哈希值),因为如果文件相同,则工作量相同,但如果文件不同,您就可以计算文件的哈希值。可以更早停止。
当然,一次比较一个字节可能有点浪费——一次读取整个块并比较它们可能是个好主意。
If you haven't already computed hashes of the files, then you might as well do a proper comparison (instead of looking at hashes), because if the files are the same it's the same amount of work, but if they're different you can stop much earlier.
Of course, comparing a byte at a time is probably a bit wasteful - probably a good idea to read whole blocks at a time and compare them.