diff 文件仅比较每行的前 n 个字符

发布于 2024-11-08 06:03:00 字数 341 浏览 7 评论 0原文

我有2个文件。我们将它们称为 md5s1.txt 和 md5s2.txt。两者都在不同的目录中包含命令的输出

find -type f -print0 | xargs -0 md5sum | sort > md5s.txt

。许多文件被重命名，但内容保持不变。因此，它们应该具有相同的 md5sum。我想生成一个 diff

diff md5s1.txt md5s2.txt

，但它应该只比较每行的前 32 个字符，即只比较 md5sum，而不是文件名。 md5sum 相等的行应被视为相等。输出应该是正常的 diff 格式。

原文

I have got 2 files. Let us call them md5s1.txt and md5s2.txt. Both contain the output of a

find -type f -print0 | xargs -0 md5sum | sort > md5s.txt

command in different directories. Many files were renamed, but the content stayed the same. Hence, they should have the same md5sum. I want to generate a diff like

diff md5s1.txt md5s2.txt

but it should compare only the first 32 characters of each line, i.e. only the md5sum, not the filename. Lines with equal md5sum should be considered equal. The output should be in normal diff format.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

初心 2024-11-15 06:03:00

简单的入门：

diff <(cut -d' ' -f1 md5s1.txt)  <(cut -d' ' -f1 md5s2.txt)

另外，考虑一下

diff -EwburqN folder1/ folder2/

Easy starter:

diff <(cut -d' ' -f1 md5s1.txt)  <(cut -d' ' -f1 md5s2.txt)

Also, consider just

diff -EwburqN folder1/ folder2/

回复收藏 0 原文

も星光 2024-11-15 06:03:00

在 <(cut -c -32 md5sums.sort.XXX) 上使用 diff 仅比较 md5 列，并告诉 diff 仅打印添加或删除的行的行号，使用 --old/new-line-format='%dn'$'\n'。将其通过管道传输到 ed md5sums.sort.XXX 中，以便它仅打印 md5sums.sort.XXX 文件中的那些行。

diff \
    --new-line-format='%dn'
 ed 的问题是它会将整个文件加载到内存中，如果您有很多校验和，这可能会成为问题。不要将 diff 的输出通过管道传送到 ed 中，而是将其传送到以下命令中，这将使用更少的内存。
diff … | (
    lnum=0;
    while read lprint; do
        while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
        echo $line;
    done
) 3<md5sums.sort.XXX

\n' \
    --old-line-format='' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.new \
    > files-added
diff \
    --new-line-format='' \
    --old-line-format='%dn'
 ed 的问题是它会将整个文件加载到内存中，如果您有很多校验和，这可能会成为问题。不要将 diff 的输出通过管道传送到 ed 中，而是将其传送到以下命令中，这将使用更少的内存。

\n' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.old \
    > files-removed

ed 的问题是它会将整个文件加载到内存中，如果您有很多校验和，这可能会成为问题。不要将 diff 的输出通过管道传送到 ed 中，而是将其传送到以下命令中，这将使用更少的内存。

Compare only the md5 column using diff on <(cut -c -32 md5sums.sort.XXX), and tell diff to print just the line numbers of added or removed lines, using --old/new-line-format='%dn'$'\n'. Pipe this into ed md5sums.sort.XXX so it will print only those lines from the md5sums.sort.XXX file.

diff \
    --new-line-format='%dn'
The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed, pipe it into the following command, which will use much less memory.
diff … | (
    lnum=0;
    while read lprint; do
        while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
        echo $line;
    done
) 3<md5sums.sort.XXX

\n' \
    --old-line-format='' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.new \
    > files-added
diff \
    --new-line-format='' \
    --old-line-format='%dn'
The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed, pipe it into the following command, which will use much less memory.

\n' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.old \
    > files-removed

The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed, pipe it into the following command, which will use much less memory.

回复收藏 0 原文