我有两个大文件(一组文件名)。每个文件中约有30.000行。我正在尝试找到一种快速的方法来查找文件2中不存在的线路。
例如,如果这是 file1:
line1
line2
line3
,这是 file2:
line1
line4
line5
,那么我的 result/output 应该是:
line2
line3
此操作:
grep-- v -f file2 file1
,
但是在我的大文件上使用时非常慢。
我怀疑使用 diff
有一个很好的方法,但是输出应该是 ,别无其他,我似乎找不到一个开关。
谁能帮助我使用Bash和Basic Linux二进制文件找到快速的方法?
编辑:要跟进我自己的问题,这是我到目前为止使用 diff
>的最佳方法:
diff file2 file1 | grep '^>' | sed 's/^>\ //'
当然,必须有更好的方法?
I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.
For example, if this is file1:
line1
line2
line3
And this is file2:
line1
line4
line5
Then my result/output should be:
line2
line3
This works:
grep -v -f file2 file1
But it is very, very slow when used on my large files.
I suspect there is a good way to do this using diff
, but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.
Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?
EDIT: To follow up on my own question, this is the best way I have found so far using diff
:
diff file2 file1 | grep '^>' | sed 's/^>\ //'
Surely, there must be a better way?
发布评论
评论(13)
comm 命令命令(“ common”的简短)可能是有用的
comm-比较两个分类的列线
man
文件实际上是非常可读的。The comm command (short for "common") may be useful
comm - compare two sorted files line by line
The
man
file is actually quite readable for this.您可以通过控制gnu
diff
输出中的旧/新/未更改线的格式来实现这一目标:输入文件应该对此进行排序以使其工作。使用
bash
(和zsh
),您可以使用Process替换<()
:在上面的 new 中EM>和未更改的线被抑制,因此只输出了更改(即在您的情况下删除的线路)。您也可以使用其他解决方案所提供的一些
diff
选项,例如-i
忽略案例或各种whitespace选项(-e ,
-b
,-v
等),以匹配不太严格。说明
选项
- new-line-format
,- old-line-format
and- 不变 - 字
让您控制diff
格式的差异,类似于printf
格式指定器。这些选项格式 new (添加),旧(已删除)和 不变线。将一个设置为空“”防止该行的输出。如果您熟悉 unified diff 格式,则可以部分重新创建以下方式:
%l
specifier是有问题的行,并且我们将每个行都带有“+”“” - “或“”,例如diff -u
(请注意,它仅输出差异,缺少
---
+++
和@@
@@ 在每个分组更改的顶部) 。您也可以使用它来做其他有用的事情%dn 。
diff
方法(以及其他建议comm
和join
)仅使用排序输入产生预期输出您可以使用<(sort ...)
对定位。这是一个简单的awk
(nawk)脚本(受Konsolebox答案中链接到的脚本的启发),该脚本接受任意订购的输入文件,和 and 以他们的顺序输出丢失的行发生在File1中。这将bia-bia-code> ll1 [] 的线数索引数组中的file1 line的整个内容以及file2 line的整个file2 incoptective arsociative arsociative arnay
ss2 [ ]
。读取两个文件后,迭代ll1
,然后在中使用 in in in 在文件2中确定File1中的行是否存在。 (如果有重复的话,这将对diff
方法具有不同的输出。)如果文件足够大以存储它们都会引起内存问题在读取File2的过程中,只有file1和删除匹配项。
以上将File1的整个内容存储在两个数组中,一个由行编号
ll1 []
索引,一个由Line Contentss1 []
索引。然后在读取文件2时,每个匹配行将从ll1 []
和ss1 []
中删除。最后,来自File1的其余线正在输出,并保留原始订单。在这种情况下,有了所述问题,您还可以使用gnu
split
split 划分和征服(过滤是GNU扩展),重复运行使用File1的块和Reading File2每次完全:请注意
-
的使用和放置含义stdin
在gawk
命令行。这是由split
从file1中提供的,在20000行每发线的块中提供。对于Non-GNU系统上的用户,几乎可以肯定您可以获得一个GNU Coreutils软件包,包括在OSX上作为 Apple Xcode 提供GNU
diff
,awk
的工具,尽管只有POSIX/BSDsplit
而不是GNU版本。You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU
diff
output:The input files should be sorted for this to work. With
bash
(andzsh
) you can sort in-place with process substitution<( )
:In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few
diff
options that other solutions don't offer, such as-i
to ignore case, or various whitespace options (-E
,-b
,-v
etc) for less strict matching.Explanation
The options
--new-line-format
,--old-line-format
and--unchanged-line-format
let you control the waydiff
formats the differences, similar toprintf
format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.If you are familiar with unified diff format, you can partly recreate it with:
The
%L
specifier is the line in question, and we prefix each with "+" "-" or " ", likediff -u
(note that it only outputs differences, it lacks the
---
+++
and@@
lines at the top of each grouped change).You can also use this to do other useful things like number each line with
%dn
.The
diff
method (along with other suggestionscomm
andjoin
) only produce the expected output with sorted input, though you can use<(sort ...)
to sort in place. Here's a simpleawk
(nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.This stores the entire contents of file1 line by line in a line-number indexed array
ll1[]
, and the entire contents of file2 line by line in a line-content indexed associative arrayss2[]
. After both files are read, iterate overll1
and use thein
operator to determine if the line in file1 is present in file2. (This will have have different output to thediff
method if there are duplicates.)In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.
The above stores the entire contents of file1 in two arrays, one indexed by line number
ll1[]
, one indexed by line contentss1[]
. Then as file2 is read, each matching line is deleted fromll1[]
andss1[]
. At the end the remaining lines from file1 are output, preserving the original order.In this case, with the problem as stated, you can also divide and conquer using GNU
split
(filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:Note the use and placement of
-
meaningstdin
on thegawk
command line. This is provided bysplit
from file1 in chunks of 20000 line per-invocation.For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU
diff
,awk
, though only a POSIX/BSDsplit
rather than a GNU version.就像Konsolebox建议的那样,如果您只添加
-f
选项,则海报GREP解决方案实际上可以很好地工作(更快),以将模式视为固定字符串而不是正则表达式。我在我必须比较的一对〜1000行文件列表上验证了这一点。使用
-f
,它花费了0.031 s(真实),而没有将GREP输出重定向到WC -L
时,它花费了2.278 s(real)。这些测试还包括
-X
开关,该开关是解决方案的必要部分,以确保在File2包含与一部分但不匹配的一部分(但不全部)的一部分或多个线路的情况下确保完全准确性。文件1。因此,不需要分类输入的解决方案是快速,灵活的(案例敏感性等)是:
这与所有版本的GREP不起作用,例如,它在MacOS中失败,其中文件1中的一行将即使是在文件2中不存在,如果它匹配另一个是它的子字符串,则显示为不存在。另外,您可以在macos上安装gnu grep 为了使用此解决方案。
Like konsolebox suggested, the posters grep solution
actually works great (faster) if you simply add the
-F
option, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With-F
it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output towc -l
.These tests also included the
-x
switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is:
This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. Alternatively you can install GNU grep on macOS in order to use this solution.
如果您缺少“花式工具”,例如在某些最小Linux发行中,则有一个解决方案,只需
cat
,sort
和uniq
:测试:
与
GREP
相比,这也相对相对。If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just
cat
,sort
anduniq
:Test:
This is also relatively fast, compared to
grep
.使用
combine
来自moreutils
软件包,一个支持不是
,和,,或,
XOR
操作,即给我file1中但不在file2中的行
或在file1中给我的行中的行减去file2
中的行中的行,请注意:
combine> combine
在执行任何操作之前,请在两个文件中进行分类并找到唯一的行,但是diff
却没有。因此,您可能会发现diff
的输出和组合
之间的差异。因此,实际上,您说的是
file1和file2中找到不同的行
在
Use
combine
frommoreutils
package, a sets utility that supportsnot
,and
,or
,xor
operationsi.e give me lines that are in file1 but not in file2
OR give me lines in file1 minus lines in file2
Note:
combine
sorts and finds unique lines in both files before performing any operation butdiff
does not. So you might find differences between output ofdiff
andcombine
.So in effect you are saying
Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2
In my experience, it's much faster than other options
排序和差异的速度是多少?
whats the speed of as sort and diff?
对我来说这似乎很快:
This seems quick for me :
-t
如果您在某些行中有一个空间,请确保它比较整个行。The
-t
makes sure that it compares the whole line, if you had a space in some of the lines.您可以使用Python:
You can use Python:
使用FGREP或将-f选项添加到GREP可能会有所帮助。但是对于更快的计算,您可以使用尴尬。
您可以尝试其中一种尴尬方法:
Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.
You could try one of these Awk methods:
http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219
我通常这样做的方式是使用
- 抑制common-lines
flag,尽管注意到这仅在您以并排格式进行时起作用。diff -y -suppress-common-lines file1.txt file2.txt
The way I usually do this is using the
--suppress-common-lines
flag, though note that this only works if your do it in side-by-side format.diff -y --suppress-common-lines file1.txt file2.txt
awk
基于基于任何排序的解决方案:awk
-based solution sans any sorting :要测试处理REGEXP Metachars,子字符串,空间,不敏感性,其他仅在情况下会有所不同的字符串,并且超过2个输入文件,我们需要一些不同的样本输入:
然后我们可以使用任何GREP来执行此操作:(或任何GREP:(
或)如果您喜欢
GREP -IFXVF&lt;(CAT Input2 Input3)input1
)或任何尴尬:最初是作为对如何打印其他文件中未发生的行?,但被移动到这里,因为这个问题似乎很可能被关闭为此,并且上述问题是我可以告诉的最好的案例否则,这里的现有答案不涵盖。
To test handling regexp metachars, substrings, spaces, case-insensitivity, otherwise-duplicate strings that only vary in case, and more than 2 input files we need some different sample input:
and then we can do this, using any grep:
(or if you prefer
grep -iFxvf <(cat input2 input3) input1
) or any awk:This was originally posted as an answer to How to print the lines that do not occur in another file? but was moved here as that question seems likely to be closed as a dup of this one and the above addresses cases that, best I can tell, are otherwise not covered by the existing answers here.