在字符串上进行比较,而不是在行上进行比较

发布于 2025-01-07 16:31:28 字数 333 浏览 0 评论 0原文

我觉得我应该能够在睡梦中做到这一点,但假设我有两个文本文件,每个文件都有一列 apache 模块的名称,没有特定的顺序。一个文件有 46 个(对于其自身而言)唯一的字符串。另一个有 67 行和 67 个 uniq(到文件)字符串。会有很多共同的字符串。

我需要找到 apache 模块的名称,这些模块不在较短的第一个文件中,而是在第二个较长的文件中。

我想通过搜索和比较字符串来做到这一点。行号、顺序或位置完全无关。我只想知道哪些模块仅在较长的文件中列出,需要安装。

默认情况下,uniq、comm 和 diff 希望按行和行号工作。 我不想并排比较;我只想要一个清单。

I feel I should be able to do this in my sleep, but let's say I have two text files each of which has a single column of the names of apache modules in no particular order. One file has 46 unique (to itself) strings. The other has 67 lines and 67 uniq (to the file) strings. There will be many strings in common.

I need to find the names of apache modules that are -not- in the shorter, first file but -are- in the second, longer file.

I want to do this by searching and comparing strings. Line number, order, or postition are completely irrellevant. I just want to know which modules listed only in the longer file need to be installed.

By default uniq, comm and diff want to work by lines, and line numbers.
I don't want a side-by-side comparison; I just want a list.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

风铃鹿 2025-01-14 16:31:28

将字符串分成几行,对它们进行排序和统一,然后使用 comm 进行分析。 (请参阅 BashFAQ #36)。

举个例子,我假设您想要比较两个 Apache 配置文件之间的 LoadModule 指令。

file1:

...other stuff...
LoadModule foo modules/foo.so
LoadModule bar modules/bar.so
LoadModule baz modules/baz.so
...other stuff...

file2:

...other stuff...
LoadModule foo modules/foo.so
...other stuff...

因此,要做到这一点:

comm -2 -3 \
  <(gawk '/LoadModule/ { print $2 }' file1 | sort -u)
  <(gawk '/LoadModule/ { print $2 }' file2 | sort -u)

...将抑制在两个文件或仅在较短文件中找到的任何行,并为您提供在第三个文件中找到的模块名称,产生以下输出:

bar
baz

对于使用考虑到更有趣的用例 - 不幸的是,虽然 GNU sort 的 -z 标志可以处理 NUL 分隔符(以允许对包含换行符的字符串进行比较),但 comm 不能。但是,您可以在支持 NUL 分隔符的 shell 中编写自己的 comm 实现,例如以下示例:

#!/bin/bash
exec 3<"$1" 4<"$2"

IFS='' read -u 4 -d ''; input_two="$REPLY"

while IFS='' read -u 3 -d '' ; do
    input_one="$REPLY"
    while [[ $input_two < $input_one ]] ; do
        IFS='' read -u 4 -d '' || exit 0
        input_two="$REPLY"
    done
    if [[ $input_two = "$input_one" ]] ; then
        printf '%s\0' "$input_two"
    fi
done

Break your strings into lines, sort and uniqify them, and use comm for the analysis. (See BashFAQ #36).

I'm going to assume, to have an example, that you want to compare the LoadModule directives between two Apache config files.

file1:

...other stuff...
LoadModule foo modules/foo.so
LoadModule bar modules/bar.so
LoadModule baz modules/baz.so
...other stuff...

file2:

...other stuff...
LoadModule foo modules/foo.so
...other stuff...

So, to do this:

comm -2 -3 \
  <(gawk '/LoadModule/ { print $2 }' file1 | sort -u)
  <(gawk '/LoadModule/ { print $2 }' file2 | sort -u)

...will suppress any lines found in both or only in the shorter file, and give you the module names found in the third, yielding the following output:

bar
baz

For folks looking at this question with more interesting use cases in mind -- unfortunately, while GNU sort's -z flag can handle NUL delimiters (to allow comparison on strings containing newlines), comm cannot. However, you can write your own comm implementation in shell which supports NUL delimiters, such as the following example:

#!/bin/bash
exec 3<"$1" 4<"$2"

IFS='' read -u 4 -d ''; input_two="$REPLY"

while IFS='' read -u 3 -d '' ; do
    input_one="$REPLY"
    while [[ $input_two < $input_one ]] ; do
        IFS='' read -u 4 -d '' || exit 0
        input_two="$REPLY"
    done
    if [[ $input_two = "$input_one" ]] ; then
        printf '%s\0' "$input_two"
    fi
done
甜中书 2025-01-14 16:31:28

我会运行一个像这样的小 bash 脚本 (differ.bash):

#!/bin/bash
f1=$1; # longer file
f2=$2; # shorter file

for item in `cat $f1`
do
    match=0
    for other in `cat $f2`
    do
        if [ "$item" == "$other" ]
        then
            match=1
            break
        fi
    done
    if [ $match != 1 ]
    then
        echo $item
    fi
done

exit 0

像这样运行它:

$ ./differ.bash file1 file2

基本上,我只是设置一个双 for 循环,较长的文件位于外循环,较短的文件位于内循环。这样,较长列表中的每个项目都会与较短列表中的项目进行比较。这使我们能够找到与较小列表中的某些内容不匹配的所有项目。


编辑:我尝试用这个更新的脚本来解决 Charles 的第一条评论:

#!/bin/bash
f1=$1; # longer file
f2=$2; # shorter file

while read item
do
    others=( "${others[@]}" "$item" )
done < $f2

while read item
do
    match=0
    for other in $others
    do
        if [ "$item" == "$other" ]
        then
            match=1
            break
        fi
    done
    if [ $match != 1 ]
    then
        echo $item
    fi
done < $f1

exit 0

I would run a little bash script like this (differ.bash):

#!/bin/bash
f1=$1; # longer file
f2=$2; # shorter file

for item in `cat $f1`
do
    match=0
    for other in `cat $f2`
    do
        if [ "$item" == "$other" ]
        then
            match=1
            break
        fi
    done
    if [ $match != 1 ]
    then
        echo $item
    fi
done

exit 0

Run it like so:

$ ./differ.bash file1 file2

Basically, I am just setting up a double for loop with the longer file on the outer loop and the shorter file on the inner loop. That way each item in the longer list gets compared with the items in the shorter list. This allows us to find all the items that don't match something in the smaller list.


Edit: I have tried to address Charles' first comment with this updated script:

#!/bin/bash
f1=$1; # longer file
f2=$2; # shorter file

while read item
do
    others=( "${others[@]}" "$item" )
done < $f2

while read item
do
    match=0
    for other in $others
    do
        if [ "$item" == "$other" ]
        then
            match=1
            break
        fi
    done
    if [ $match != 1 ]
    then
        echo $item
    fi
done < $f1

exit 0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文