在一个不在另一个文件中找到行的快速方法?

发布于 2025-02-11 01:24:31 字数 693 浏览 4 评论 0 原文

我有两个大文件(一组文件名)。每个文件中约有30.000行。我正在尝试找到一种快速的方法来查找文件2中不存在的线路。

例如,如果这是 file1:

line1
line2
line3

,这是 file2:

line1
line4
line5

,那么我的 result/output 应该是:

line2
line3

此操作:

grep-- v -f file2 file1

但是在我的大文件上使用时非常慢。

我怀疑使用 diff 有一个很好的方法,但是输出应该是 ,别无其他,我似乎找不到一个开关。

谁能帮助我使用Bash和Basic Linux二进制文件找到快速的方法?

编辑:要跟进我自己的问题,这是我到目前为止使用 diff >的最佳方法:

diff file2 file1 | grep '^>' | sed 's/^>\ //'

当然,必须有更好的方法?

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

For example, if this is file1:

line1
line2
line3

And this is file2:

line1
line4
line5

Then my result/output should be:

line2
line3

This works:

grep -v -f file2 file1

But it is very, very slow when used on my large files.

I suspect there is a good way to do this using diff, but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.

Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?

EDIT: To follow up on my own question, this is the best way I have found so far using diff:

diff file2 file1 | grep '^>' | sed 's/^>\ //'

Surely, there must be a better way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

撩心不撩汉 2025-02-18 01:24:31

comm 命令命令(“ common”的简短)可能是有用的 comm-比较两个分类的列线

#find lines only in file1
comm -23 file1 file2 

#find lines only in file2
comm -13 file1 file2 

#find lines common to both files
comm -12 file1 file2 

man 文件实际上是非常可读的。

The comm command (short for "common") may be useful comm - compare two sorted files line by line

#find lines only in file1
comm -23 file1 file2 

#find lines only in file2
comm -13 file1 file2 

#find lines common to both files
comm -12 file1 file2 

The man file is actually quite readable for this.

梦幻的味道 2025-02-18 01:24:31

您可以通过控制gnu diff 输出中的旧/新/未更改线的格式来实现这一目标:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

输入文件应该对此进行排序以使其工作。使用 bash (和 zsh ),您可以使用Process替换<()

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

在上面的 new 中EM>和未更改的线被抑制,因此只输出了更改(即在您的情况下删除的线路)。您也可以使用其他解决方案所提供的一些 diff 选项,例如 -i 忽略案例或各种whitespace选项( -e -b -v 等),以匹配不太严格。


说明

选项 - new-line-format - old-line-format and - 不变 - 字让您控制 diff 格式的差异,类似于 printf 格式指定器。这些选项格式 new (添加),(已删除)和 不变线。将一个设置为空“”防止该行的输出。

如果您熟悉 unified diff 格式,则可以部分重新创建以下方式:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
     --new-line-format="+%L" file1 file2

%l specifier是有问题的行,并且我们将每个行都带有“+”“” - “或“”,例如 diff -u
(请注意,它仅输出差异,缺少 --- +++ @@ @@ 在每个分组更改的顶部) 。
您也可以使用它来做其他有用的事情%dn 。


diff 方法(以及其他建议 comm join )仅使用排序输入产生预期输出您可以使用&lt;(sort ...)对定位。这是一个简单的 awk (nawk)脚本(受Konsolebox答案中链接到的脚本的启发),该脚本接受任意订购的输入文件, and 以他们的顺序输出丢失的行发生在File1中。

# output lines in file1 that are not in file2
BEGIN { FS="" }                         # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; }     # file1, index by lineno
(NR!=FNR) { ss2[$0]++; }                # file2, index by string
END {
    for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

这将bia-bia-code> ll1 [] 的线数索引数组中的file1 line的整个内容以及file2 line的整个file2 incoptective arsociative arsociative arnay ss2 [ ] 。读取两个文件后,迭代 ll1 ,然后在中使用 in in in 在文件2中确定File1中的行是否存在。 (如果有重复的话,这将对 diff 方法具有不同的输出。)

如果文件足够大以存储它们都会引起内存问题在读取File2的过程中,只有file1和删除匹配项。

BEGIN { FS="" }
(NR==FNR) {  # file1, index by lineno and string
  ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) {  # file2
  if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
  for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

以上将File1的整个内容存储在两个数组中,一个由行编号 ll1 [] 索引,一个由Line Content ss1 [] 索引。然后在读取文件2时,每个匹配行将从 ll1 [] ss1 [] 中删除。最后,来自File1的其余线正在输出,并保留原始订单。

在这种情况下,有了所述问题,您还可以使用gnu split split 划分和征服(过滤是GNU扩展),重复运行使用File1的块和Reading File2每次完全:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

请注意 - 的使用和放置含义 stdin gawk 命令行。这是由 split 从file1中提供的,在20000行每发线的块中提供。

对于Non-GNU系统上的用户,几乎可以肯定您可以获得一个GNU Coreutils软件包,包括在OSX上作为 Apple Xcode 提供GNU diff awk 的工具,尽管只有POSIX/BSD split 而不是GNU版本。

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.


Explanation

The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.

If you are familiar with unified diff format, you can partly recreate it with:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
     --new-line-format="+%L" file1 file2

The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the --- +++ and @@ lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn.


The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.

# output lines in file1 that are not in file2
BEGIN { FS="" }                         # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; }     # file1, index by lineno
(NR!=FNR) { ss2[$0]++; }                # file2, index by string
END {
    for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

BEGIN { FS="" }
(NR==FNR) {  # file1, index by lineno and string
  ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) {  # file2
  if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
  for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.

别想她 2025-02-18 01:24:31

就像Konsolebox建议的那样,如果您只添加 -f 选项,则海报GREP解决方案

grep -v -f file2 file1

实际上可以很好地工作(更快),以将模式视为固定字符串而不是正则表达式。我在我必须比较的一对〜1000行文件列表上验证了这一点。使用 -f ,它花费了0.031 s(真实),而没有将GREP输出重定向到 WC -L 时,它花费了2.278 s(real)。

这些测试还包括 -X 开关,该开关是解决方案的必要部分,以确保在File2包含与一部分但不匹配的一部分(但不全部)的一部分或多个线路的情况下确保完全准确性。文件1。

因此,不需要分类输入的解决方案是快速,灵活的(案例敏感性等)是:

grep -F -x -v -f file2 file1

这与所有版本的GREP不起作用,例如,它在MacOS中失败,其中文件1中的一行将即使是在文件2中不存在,如果它匹配另一个是它的子字符串,则显示为不存在。另外,您可以在macos上安装gnu grep 为了使用此解决方案。

Like konsolebox suggested, the posters grep solution

grep -v -f file2 file1

actually works great (faster) if you simply add the -F option, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With -F it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output to wc -l.

These tests also included the -x switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.

So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is:

grep -F -x -v -f file2 file1

This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. Alternatively you can install GNU grep on macOS in order to use this solution.

凉栀 2025-02-18 01:24:31

如果您缺少“花式工具”,例如在某些最小Linux发行中,则有一个解决方案,只需 cat sort uniq

cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

测试:

seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

# Output:
1
2    

GREP 相比,这也相对相对

If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat, sort and uniq:

cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

Test:

seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

# Output:
1
2    

This is also relatively fast, compared to grep.

苏辞 2025-02-18 01:24:31

使用 combine 来自 moreutils 软件包,一个支持不是,和,,或, XOR 操作

combine file1 not file2

,即给我file1中但不在file2中的行

或在file1中给我的行中的行减去file2

中的行中的行,请注意: combine> combine 在执行任何操作之前,请在两个文件中进行分类并找到唯一的行,但是 diff 却没有。因此,您可能会发现 diff 的输出和组合之间的差异。

因此,实际上,您说的是

file1和file2中找到不同的行

Use combine from moreutils package, a sets utility that supports not, and, or, xor operations

combine file1 not file2

i.e give me lines that are in file1 but not in file2

OR give me lines in file1 minus lines in file2

Note: combine sorts and finds unique lines in both files before performing any operation but diff does not. So you might find differences between output of diff and combine.

So in effect you are saying

Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2

In my experience, it's much faster than other options

意犹 2025-02-18 01:24:31

排序和差异的速度是多少?

sort file1 -u > file1.sorted
sort file2 -u > file2.sorted
diff file1.sorted file2.sorted

whats the speed of as sort and diff?

sort file1 -u > file1.sorted
sort file2 -u > file2.sorted
diff file1.sorted file2.sorted
木槿暧夏七纪年 2025-02-18 01:24:31

对我来说这似乎很快:

comm -1 -3 <(sort file1.txt) <(sort file2.txt) > output.txt

This seems quick for me :

comm -1 -3 <(sort file1.txt) <(sort file2.txt) > output.txt
妥活 2025-02-18 01:24:31
$ join -v 1 -t '' file1 file2
line2
line3

-t 如果您在某些行中有一个空间,请确保它比较整个行。

$ join -v 1 -t '' file1 file2
line2
line3

The -t makes sure that it compares the whole line, if you had a space in some of the lines.

悲欢浪云 2025-02-18 01:24:31

您可以使用Python:

python -c '
lines_to_remove = set()
with open("file2", "r") as f:
    for line in f.readlines():
        lines_to_remove.add(line.strip())

with open("f1", "r") as f:
    for line in f.readlines():
        if line.strip() not in lines_to_remove:
            print(line.strip())
'

You can use Python:

python -c '
lines_to_remove = set()
with open("file2", "r") as f:
    for line in f.readlines():
        lines_to_remove.add(line.strip())

with open("f1", "r") as f:
    for line in f.readlines():
        if line.strip() not in lines_to_remove:
            print(line.strip())
'
末骤雨初歇 2025-02-18 01:24:31

使用FGREP或将-f选项添加到GREP可能会有所帮助。但是对于更快的计算,您可以使用尴尬。

您可以尝试其中一种尴尬方法:

Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.

You could try one of these Awk methods:

http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219

风苍溪 2025-02-18 01:24:31

我通常这样做的方式是使用 - 抑制common-lines flag,尽管注意到这仅在您以并排格式进行时起作用。

diff -y -suppress-common-lines file1.txt file2.txt

The way I usually do this is using the --suppress-common-lines flag, though note that this only works if your do it in side-by-side format.

diff -y --suppress-common-lines file1.txt file2.txt

秋叶绚丽 2025-02-18 01:24:31

awk 基于基于任何排序的解决方案:

awk '! (__[$+_] += NR == FNR)' FS='\n' file2 file1 

awk-based solution sans any sorting :

awk '! (__[$+_] += NR == FNR)' FS='\n' file2 file1 
幸福还没到 2025-02-18 01:24:31

要测试处理REGEXP Metachars,子字符串,空间,不敏感性,其他仅在情况下会有所不同的字符串,并且超过2个输入文件,我们需要一些不同的样本输入:

$ head input1 input2 input3
==> input1 <==
aAa
bBb
a.a
aa
BbB
AaA
ccc
ddd
E e E e

==> input2 <==
aaa
bb
b.b

==> input3 <==
ccc
dddd
e e e e

然后我们可以使用任何GREP来执行此操作:(或任何GREP:(

$ grep -iFxv -f input2 -f input3 input1
bBb
a.a
aa
BbB
ddd

或)如果您喜欢 GREP -IFXVF&lt;(CAT Input2 Input3)input1 )或任何尴尬:

$ awk '
    { lc = tolower($0) }
    NR==FNR { a[lc] = ( lc in a ? a[lc] ORS : "" ) $0; next }
    { delete a[lc] }
    END { for ( lc in a ) print a[lc] }
' input1 input2 input3
a.a
bBb
BbB
ddd
aa

最初是作为对如何打印其他文件中未发生的行?,但被移动到这里,因为这个问题似乎很可能被关闭为此,并且上述问题是我可以告诉的最好的案例否则,这里的现有答案不涵盖。

To test handling regexp metachars, substrings, spaces, case-insensitivity, otherwise-duplicate strings that only vary in case, and more than 2 input files we need some different sample input:

$ head input1 input2 input3
==> input1 <==
aAa
bBb
a.a
aa
BbB
AaA
ccc
ddd
E e E e

==> input2 <==
aaa
bb
b.b

==> input3 <==
ccc
dddd
e e e e

and then we can do this, using any grep:

$ grep -iFxv -f input2 -f input3 input1
bBb
a.a
aa
BbB
ddd

(or if you prefer grep -iFxvf <(cat input2 input3) input1) or any awk:

$ awk '
    { lc = tolower($0) }
    NR==FNR { a[lc] = ( lc in a ? a[lc] ORS : "" ) $0; next }
    { delete a[lc] }
    END { for ( lc in a ) print a[lc] }
' input1 input2 input3
a.a
bBb
BbB
ddd
aa

This was originally posted as an answer to How to print the lines that do not occur in another file? but was moved here as that question seems likely to be closed as a dup of this one and the above addresses cases that, best I can tell, are otherwise not covered by the existing answers here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文