比较 2 个相似的文件并仅输出差异,保留它们出现的顺序?
希望有人能帮助我解决这个问题
我有 2 个文件,一个是 325 行长,一个是 361 行长。
这些文件大部分内容相同,但第二个文件插入了随机的额外行。我只对额外的行感兴趣,并且需要保留它们在文件中出现的顺序。
这些文件包含大约 31 行的重复段落 - 我知道该段落的第一行和最后一行,并且删除整个段落没有问题,但不知道如何删除。
即 File1
The quick brown
fox jumped
over the
lazy dog
The quick brown
fox jumped
over the
lazy dog
The quick brown
fox jumped
over the
lazy dog
即 File2
The quick brown
fox jumped
over the
lazy dog
sadhasdgh
qyyutrytkdaslksad
utyiuiytiuyo
The quick brown
fox jumped
over the
lazy dog
djakdjhgmv
asdjkljkgfyiyi
The quick brown
fox jumped
over the
lazy dog
jghytpuptou
我只需要按以下顺序输出额外的行:
sadhasdgh
qyyutrytkdaslksad
utyiuiytiuyo
djakdjhgmv
asdjkljkgfyiyi
jghytpuptou
任何帮助或建议将不胜感激,不幸的是我不是 *nix 人:( 我尝试了一些 diff 表达式和 comm 表达式,但无法得到我需要的东西。
hoping someone can help me get my head around this
I have 2 files, one is 325 lines long, one is 361 lines long.
The bulk of these files is identical content but the 2nd one has random extra lines inserted. I am only interested in the extra lines, and I need to preserve the order in which they occur in the file.
The files contain a repeating paragraph of approximately 31 lines - I know the first and last line of this paragraph, and have no problems with dropping the entire paragraph, but can't work out how.
i.e. File1
The quick brown
fox jumped
over the
lazy dog
The quick brown
fox jumped
over the
lazy dog
The quick brown
fox jumped
over the
lazy dog
i.e. File2
The quick brown
fox jumped
over the
lazy dog
sadhasdgh
qyyutrytkdaslksad
utyiuiytiuyo
The quick brown
fox jumped
over the
lazy dog
djakdjhgmv
asdjkljkgfyiyi
The quick brown
fox jumped
over the
lazy dog
jghytpuptou
I need to output only the extra lines in this order:
sadhasdgh
qyyutrytkdaslksad
utyiuiytiuyo
djakdjhgmv
asdjkljkgfyiyi
jghytpuptou
Any help or advice would be gratefully received, I am not a *nix person unfortunately :(
I tried a few diff expressions and comm expressions, but can't get what I need.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
尝试这个神奇的命令:
diff file1.txt file2.txt
应该输出类似sed -n 's/^>; \(.*\)/\1/p'
应该找到以>
开头的行,并输出不带>
的行。这不起作用的可能原因是您系统上 diff 的输出不同?Try this magic command:
diff file1.txt file2.txt
should output something likesed -n 's/^> \(.*\)/\1/p'
should find lines staring with>
and output that lines without>
. Possible reason why this doesn't work is different output of diff at your system?这应该有效 -
解释:
NR
和FNR
是awk 的内置变量
。NR
注册记录数,并且在处理两个文件时不会重置为0
。FNR
与NR
类似,但在文件完全解析后重置为0
。在此
awk
单行代码中,我们保留条件NR==FNR
,即强制执行操作{a[$0]++;next}
仅在 file1 上(因为NR==FNR
仅在我们使用file1
之前才为真)。此操作将每一行存储在一个数组中。添加next
以便不会调用第二个操作
。一旦此NR==FNR
变为untrue
,则永远不会调用第一个操作
。awk
转到第二个操作
,即检查file2
相对于数组
的内容(即文件1
)。如果file2
的内容在array
中,我们忽略它。如果数组中不存在,我们将打印它,因为这些行将是额外的行,并且仅在file2
中。测试:
文件1:
文件2:
执行:
This should work -
Explaination:
NR
andFNR
areawk's built-in variables
.NR
registers the number of records and does not get reset to0
when working with two files.FNR
is similar toNR
but gets reset to0
after the file is completely parsed through.In this
awk
one-liner, we keep that conditionNR==FNR
which is to force action{a[$0]++;next}
only on the file1 (asNR==FNR
will only be true till we are working withfile1
). This action stores each line in anarray
.next
is added so that thesecond action
does not get called upon. Once thisNR==FNR
becomesuntrue
, thefirst action
is never called.awk
moves to thesecond action
which is to check the content of thefile2
with respect to thearray
(i.efile1
). If the content offile2
is in thearray
, we ignore it. If it is not there in the array we print it as those lines would be the ones that are extra and only infile2
.Test:
File1:
File2:
Execution:
这可能对你有用(GNU diff):
This might work for you (GNU diff):