AWK - 比较两个 1 列文件中的匹配字符串,然后写入包含更新信息的文件
我在比较两个不同编号的文件时遇到问题。行并使用文件中的信息来更新另一个文件。我尝试了在网上找到的各种关键示例,但似乎都不起作用。
希望你能帮我解决这个问题。
我有两个文件:
$ cat 1.txt
>01234
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>13920
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
(>1000 lines)
$ cat 2.txt
hbcd/efgh/z-01234/2000
hbcd/efgh/zw-11000/2000
hbcd/efgh/t-13290/2000
...
...
(<1000 lines)
我的目的是拥有一个更新的文件 1.txt
,其中包含更新的行,但也保留文件 2.txt
中不匹配的行,所以可以保存为新文件,如下所示:
$ cat 3.txt
>abcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>abcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
我尝试过这样的操作: awk 'NR==FNR{a[$0]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 1.txt 2.txt > 3.txt
或类似这样(根据 1.txt 中的子字符串进行搜索): awk 'NR==FNR{a[substr($0,2,5)]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 2.txt 1 .txt> 3.txt
但我在输出文件中得到了混淆的行,如下所示(或者甚至没有分别来自 2.txt 的行):
01234 17321 SSSSSSKKKKKKLLLIIIMMMMMMNNNNNNNAAAA abcd/efgh/t-13290/2000 NNNNNNNNAAAAAANNNNNNNAAAAANNNNNAAAAANNNNAA ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
我已经很长时间没有使用 awk 了,我不确定数组和键是如何工作的。
更新: 我尝试编写一个 awk 脚本来执行上述操作。检查它们的条件有效,但不知怎的,我在编写 1.txt 中的行与 2.txt 中的行不匹配时仍然遇到问题。
BEGIN{
i = 0;
j = 0;
k = 0;
maxi = 0;
maxj = 0;
maxk = 0;
FS = "\/";
}
FILENAME == ARGV[1]{
header1=substr($0,1,1);
if(header1==">"){
++maxi;
seqcode1[maxi]=substr($0,2,5);
# printf("%s\n",seqcode1[maxi]);
}
else if(header1!=">"){
++maxk;
seqFASTA[maxk]=$0;
# print seqFASTA[maxk];
}
}
FILENAME == ARGV[2]{
header2=substr($0,1,1);
if(header2=="h"){
++maxj;
wholename[maxj]=$0;
seqcode2[maxj]=substr($3,4,5);
# printf("%s\n",seqcode2[maxj]);
}
}
END{
for(i=1;i<=maxi=maxk;i++){
for(j=1;j<=maxj;j++){
if(seqcode1[i] == seqcode2[j]) {
printf("%s %s %s\n",seqcode1[i],seqcode2[j],wholename[j]);
}
else
print seqcode1[i];
print seqFASTA[k];
}
}
}
我认为问题可能在于声明 seqFASTA 但我不确定在哪里。
非常感谢! M。
I have a problem on comparing two files of different no. of lines and use info from a file to update the other. I have tried various key examples that I found online but none seems to work.
Hopefully you could help me with this one.
I have two files:
$ cat 1.txt
>01234
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>13920
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
(>1000 lines)
$ cat 2.txt
hbcd/efgh/z-01234/2000
hbcd/efgh/zw-11000/2000
hbcd/efgh/t-13290/2000
...
...
(<1000 lines)
My intention is to have an updated file 1.txt
, with updated lines but also to keep the lines that are not matched in file 2.txt
, so could be saved as a new file as follows:
$ cat 3.txt
>abcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>abcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
I have tried something like this:
awk 'NR==FNR{a[$0]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 1.txt 2.txt > 3.txt
or like this (to search based on the substring in 1.txt):
awk 'NR==FNR{a[substr($0,2,5)]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 2.txt 1.txt > 3.txt
but the I get mixed up lines in the output file, something like this (or even without the lines from 2.txt, respectively):
01234
17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
abcd/efgh/t-13290/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
I haven't used awk for very long time and I'm not sure how the arrays and keys work.
Update:
I have tried to write an awk script, to do the above. The condition to check them works but somehow I still have a problem with writing the lines from 1.txt that don't match the ones from 2.txt.
BEGIN{
i = 0;
j = 0;
k = 0;
maxi = 0;
maxj = 0;
maxk = 0;
FS = "\/";
}
FILENAME == ARGV[1]{
header1=substr($0,1,1);
if(header1==">"){
++maxi;
seqcode1[maxi]=substr($0,2,5);
# printf("%s\n",seqcode1[maxi]);
}
else if(header1!=">"){
++maxk;
seqFASTA[maxk]=$0;
# print seqFASTA[maxk];
}
}
FILENAME == ARGV[2]{
header2=substr($0,1,1);
if(header2=="h"){
++maxj;
wholename[maxj]=$0;
seqcode2[maxj]=substr($3,4,5);
# printf("%s\n",seqcode2[maxj]);
}
}
END{
for(i=1;i<=maxi=maxk;i++){
for(j=1;j<=maxj;j++){
if(seqcode1[i] == seqcode2[j]) {
printf("%s %s %s\n",seqcode1[i],seqcode2[j],wholename[j]);
}
else
print seqcode1[i];
print seqFASTA[k];
}
}
}
I think the problem may be with declaring seqFASTA but I'm not sure where.
Thank you very much!
M.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我假设
1.txt
中的13920
应该是13290
。以下是一些替代解决方案:
I'm assuming
13920
should be13290
in1.txt
.Here are some alternate solutions: