在 Cygwin 上加速文件比较（使用“cmp”）？

发布于 2024-12-28 17:25:58 字数 429 浏览 3 评论 0原文

我在 Cygwin 上编写了一个 bash 脚本，它很像 rsync，尽管差异很大，以至于我相信我实际上无法使用 rsync 来满足我的需要。它会迭代相应目录中的大约一千对文件，并将它们与 cmp 进行比较。

不幸的是，这似乎运行得非常慢——大约需要十倍（编辑：实际上是 25！）时间是使用 Python 程序生成一组文件所需的时间。

我认为这出奇的慢，对吗？是否有任何简单且速度更快的替代方案？

（详细说明我的用例：我在临时目录中自动生成一堆 .c 文件，当我重新生成它们时，我只想复制已更改为实际源目录的目录，而未更改的目录保持不变（具有旧的创建时间），以便 make 知道不需要重新编译它们。所有生成的文件都是 .c不过，我需要进行二进制比较而不是文本比较。）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嗫嚅 2025-01-04 17:25:58

也许您也应该使用 Python 来完成部分甚至全部比较工作？

一项改进是仅在文件大小相同时才运行 cmp；如果它们不同，则显然文件已更改。您可以考虑使用 MD5 或 SHA1 或 SHA-256 或您喜欢的任何内容（使用 Python 模块或扩展，如果这是正确的术语）为每个文件生成哈希，而不是运行 cmp。如果您认为自己不会处理恶意意图，那么 MD5 可能足以识别差异。

即使在 shell 脚本中，您也可以运行外部哈希命令，并为其提供一个目录中所有文件的名称，然后为其提供另一目录中所有文件的名称。然后，您可以读取两组哈希值和文件名，并确定哪些已更改。

是的，听起来确实花费了太长时间。但麻烦在于必须启动 1000 个 cmp 副本，以及其他处理。上面的 Python 和 shell 脚本建议有一个共同点，那就是它们可以避免运行程序 1000 次；他们试图尽量减少执行的程序数量。我预计，执行的进程数量的减少将为您带来巨大的收益。

如果您可以保留“当前文件集”中的哈希值，并简单地为新文件集生成新的哈希值，然后比较它们，那么您会做得很好。显然，如果包含“旧哈希值”（当前文件集）的文件丢失，您将必须从现有文件重新生成它。这稍微充实了评论中的信息。

另一种可能性：您能否跟踪用于生成这些文件的数据的更改，并使用它来告诉您哪些文件将发生更改（或者至少限制可能已更改并因此需要更改的文件集）比较，因为您的评论表明大多数文件每次都是相同的）。

回复收藏 0 原文

明月夜 2025-01-04 17:25:58

如果您可以合理地比较一个进程中的一千个奇怪文件，而不是生成并执行一千个额外的程序，那可能是理想的。

简短的回答：将 --silent 添加到您的 cmp 调用中（如果尚不存在）。

在检查数据之前，您可以通过执行一些文件大小检查来加快 Python 版本的速度。

首先，如果您可以更改为单个 build 目录，那么一种快速而巧妙的 bash(1) 技术可能会容易得多：使用 bash -N 测试：

$ echo foo > file
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$ cat file
foo
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
older than last read
$ echo blort > file # regenerate the file here
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$

当然，如果文件的某些子集依赖于生成文件的其他子集，则这种方法根本不起作用。（这可能是避免这种技术的充分理由；这取决于您。）

在您的 Python 程序中，您还可以使用 os.stat() 检查文件大小确定是否应该调用比较例程；如果文件大小不同，您并不真正关心哪些字节发生了变化，因此您可以跳过读取这两个文件。（这在 bash(1) 中很难做到——我知道没有任何机制可以在不执行另一个程序的情况下获取 bash(1) 中的文件大小，这会失败此检查的全部要点。）

cmp 程序将在内部进行大小比较，前提是您正在使用 --silent 标志和两个文件是常规文件并且两个文件都位于同一位置。（这是通过 --ignore-initial 标志设置的。）如果您没有使用 --silent，请添加它并看看有什么区别。

If you can reasonably do the comparison of a thousand odd files within one process rather than spawning and executing a thousand additional programs, that would probably be ideal.

The short answer: Add --silent to your cmp call, if it isn't there already.

You might be able to speed up the Python version by doing some file size checks before checking the data.

First, a quick-and-hacky bash(1) technique that might be far easier if you can change to a single build directory: use the bash -N test:

$ echo foo > file
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$ cat file
foo
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
older than last read
$ echo blort > file # regenerate the file here
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$

Of course, if some subset of the files depend upon some other subset of the generated files, this approach won't work at all. (This might be reason enough to avoid this technique; it's up to you.)

Within your Python program, you could also check the file sizes using os.stat() to determine whether or not you should call your comparison routine; if the files are different sizes, you don't really care which bytes changed, so you can skip reading both files. (This would be difficult to do in bash(1) -- I know of no mechanism to get the file size in bash(1) without executing another program, which defeats the whole point of this check.)

The cmp program will do the size comparison internally IFF you are using the --silent flag and both files are regular files and both files are positioned at the same place. (This is set via the --ignore-initial flag.) If you're not using --silent, add it and see what the difference is.

回复收藏 0 原文

~没有更多了~