在 Cygwin 上加速文件比较(使用“cmp”)?
我在 Cygwin 上编写了一个 bash 脚本,它很像 rsync,尽管差异很大,以至于我相信我实际上无法使用 rsync 来满足我的需要。它会迭代相应目录中的大约一千对文件,并将它们与 cmp
进行比较。
不幸的是,这似乎运行得非常慢——大约需要十倍(编辑:实际上是 25!)时间是使用 Python 程序生成一组文件所需的时间。
我认为这出奇的慢,对吗?是否有任何简单且速度更快的替代方案?
(详细说明我的用例:我在临时目录中自动生成一堆 .c
文件,当我重新生成它们时,我只想复制 已更改为实际源目录的目录,而未更改的目录保持不变(具有旧的创建时间),以便 make
知道不需要重新编译它们。所有生成的文件都是 .c
不过,我需要进行二进制比较而不是文本比较。)
I've written a bash script on Cygwin which is rather like rsync
, although different enough that I believe I can't actually use rsync
for what I need. It iterates over about a thousand pairs of files in corresponding directories, comparing them with cmp
.
Unfortunately, this seems to run abysmally slowly -- taking about ten (Edit: actually 25!) times as long as it takes to generate one of the sets of files using a Python program.
Am I right in thinking that this is surprisingly slow? Are there any simple alternatives that would go faster?
(To elaborate a bit on my use-case: I am autogenerating a bunch of .c
files in a temporary directory, and when I re-generate them, I'd like to copy only the ones that have changed into the actual source directory, leaving the unchanged ones untouched (with their old creation times) so that make
will know that it doesn't need to recompile them. Not all the generated files are .c
files, though, so I need to do binary comparisons rather than text comparisons.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
也许您也应该使用 Python 来完成部分甚至全部比较工作?
一项改进是仅在文件大小相同时才运行
cmp
;如果它们不同,则显然文件已更改。您可以考虑使用 MD5 或 SHA1 或 SHA-256 或您喜欢的任何内容(使用 Python 模块或扩展,如果这是正确的术语)为每个文件生成哈希,而不是运行cmp
。如果您认为自己不会处理恶意意图,那么 MD5 可能足以识别差异。即使在 shell 脚本中,您也可以运行外部哈希命令,并为其提供一个目录中所有文件的名称,然后为其提供另一目录中所有文件的名称。然后,您可以读取两组哈希值和文件名,并确定哪些已更改。
是的,听起来确实花费了太长时间。但麻烦在于必须启动 1000 个
cmp
副本,以及其他处理。上面的 Python 和 shell 脚本建议有一个共同点,那就是它们可以避免运行程序 1000 次;他们试图尽量减少执行的程序数量。我预计,执行的进程数量的减少将为您带来巨大的收益。如果您可以保留“当前文件集”中的哈希值,并简单地为新文件集生成新的哈希值,然后比较它们,那么您会做得很好。显然,如果包含“旧哈希值”(当前文件集)的文件丢失,您将必须从现有文件重新生成它。这稍微充实了评论中的信息。
另一种可能性:您能否跟踪用于生成这些文件的数据的更改,并使用它来告诉您哪些文件将发生更改(或者至少限制可能已更改并因此需要更改的文件集)比较,因为您的评论表明大多数文件每次都是相同的)。
Maybe you should use Python to do some - or even all - of the comparison work too?
One improvement would be to only bother running
cmp
if the file sizes are the same; if they're different, clearly the file has changed. Instead of runningcmp
, you could think about generating a hash for each file, using MD5 or SHA1 or SHA-256 or whatever takes your fancy (using Python modules or extensions, if that's the correct term). If you don't think you'll be dealing with malicious intent, then MD5 is probably sufficient to identify differences.Even in a shell script, you could run an external hashing command, and give it the names of all the files in one directory, then give it the names of all the files in the other directory. Then you can read the two sets of hash values plus file names and decide which have changed.
Yes, it does sound like it is taking too long. But the trouble includes having to launch 1000 copies of
cmp
, plus the other processing. Both the Python and the shell script suggestions above have in common that they avoid running a program 1000 times; they try to minimize the number of programs executed. This reduction in the number of processes executed will give you a pretty big bang for you buck, I expect.If you can keep the hashes from 'the current set of files' around and simply generate new hashes for the new set of files, and then compare them, you will do well. Clearly, if the file containing the 'old hashes' (current set of files) is missing, you'll have to regenerate it from the existing files. This is slightly fleshing out information in the comments.
One other possibility: can you track changes in the data that you use to generate these files and use that to tell you which files will have changed (or, at least, limit the set of files that may have changed and that therefore need to be compared, as your comments indicate that most files are the same each time).
如果您可以合理地比较一个进程中的一千个奇怪文件,而不是生成并执行一千个额外的程序,那可能是理想的。
简短的回答:将
--silent
添加到您的cmp
调用中(如果尚不存在)。在检查数据之前,您可以通过执行一些文件大小检查来加快 Python 版本的速度。
首先,如果您可以更改为单个
build
目录,那么一种快速而巧妙的bash(1)
技术可能会容易得多:使用bash
-N
测试:当然,如果文件的某些子集依赖于生成文件的其他子集,则这种方法根本不起作用。 (这可能是避免这种技术的充分理由;这取决于您。)
在您的 Python 程序中,您还可以使用
os.stat()
检查文件大小确定是否应该调用比较例程;如果文件大小不同,您并不真正关心哪些字节发生了变化,因此您可以跳过读取这两个文件。 (这在 bash(1) 中很难做到——我知道没有任何机制可以在不执行另一个程序的情况下获取 bash(1) 中的文件大小,这会失败此检查的全部要点。)cmp
程序将在内部进行大小比较,前提是您正在使用--silent
标志和两个文件是常规文件并且两个文件都位于同一位置。 (这是通过--ignore-initial
标志设置的。)如果您没有使用--silent
,请添加它并看看有什么区别。If you can reasonably do the comparison of a thousand odd files within one process rather than spawning and executing a thousand additional programs, that would probably be ideal.
The short answer: Add
--silent
to yourcmp
call, if it isn't there already.You might be able to speed up the Python version by doing some file size checks before checking the data.
First, a quick-and-hacky
bash(1)
technique that might be far easier if you can change to a singlebuild
directory: use thebash
-N
test:Of course, if some subset of the files depend upon some other subset of the generated files, this approach won't work at all. (This might be reason enough to avoid this technique; it's up to you.)
Within your Python program, you could also check the file sizes using
os.stat()
to determine whether or not you should call your comparison routine; if the files are different sizes, you don't really care which bytes changed, so you can skip reading both files. (This would be difficult to do inbash(1)
-- I know of no mechanism to get the file size inbash(1)
without executing another program, which defeats the whole point of this check.)The
cmp
program will do the size comparison internally IFF you are using the--silent
flag and both files are regular files and both files are positioned at the same place. (This is set via the--ignore-initial
flag.) If you're not using--silent
, add it and see what the difference is.