AWK - 比较两个 1 列文件中的匹配字符串,然后写入包含更新信息的文件
我在比较两个不同编号的文件时遇到问题。行并使用文件中的信息来更新另一个文件。我尝试了在网上找到的各种关键示例,但似乎都不起作用。
希望你能帮我解决这个问题。
我有两个文件:
$ cat 1.txt
>01234
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>13920
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
(>1000 lines)
$ cat 2.txt
hbcd/efgh/z-01234/2000
hbcd/efgh/zw-11000/2000
hbcd/efgh/t-13290/2000
...
...
(<1000 lines)
我的目的是拥有一个更新的文件 1.txt
,其中包含更新的行,但也保留文件 2.txt
中不匹配的行,所以可以保存为新文件,如下所示:
$ cat 3.txt
>abcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>abcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
我尝试过这样的操作: awk 'NR==FNR{a[$0]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 1.txt 2.txt > 3.txt
或类似这样(根据 1.txt 中的子字符串进行搜索): awk 'NR==FNR{a[substr($0,2,5)]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 2.txt 1 .txt> 3.txt
但我在输出文件中得到了混淆的行,如下所示(或者甚至没有分别来自 2.txt 的行):
01234 17321 SSSSSSKKKKKKLLLIIIMMMMMMNNNNNNNAAAA abcd/efgh/t-13290/2000 NNNNNNNNAAAAAANNNNNNNAAAAANNNNNAAAAANNNNAA ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
我已经很长时间没有使用 awk 了,我不确定数组和键是如何工作的。
更新: 我尝试编写一个 awk 脚本来执行上述操作。检查它们的条件有效,但不知怎的,我在编写 1.txt 中的行与 2.txt 中的行不匹配时仍然遇到问题。
BEGIN{
i = 0;
j = 0;
k = 0;
maxi = 0;
maxj = 0;
maxk = 0;
FS = "\/";
}
FILENAME == ARGV[1]{
header1=substr($0,1,1);
if(header1==">"){
++maxi;
seqcode1[maxi]=substr($0,2,5);
# printf("%s\n",seqcode1[maxi]);
}
else if(header1!=">"){
++maxk;
seqFASTA[maxk]=$0;
# print seqFASTA[maxk];
}
}
FILENAME == ARGV[2]{
header2=substr($0,1,1);
if(header2=="h"){
++maxj;
wholename[maxj]=$0;
seqcode2[maxj]=substr($3,4,5);
# printf("%s\n",seqcode2[maxj]);
}
}
END{
for(i=1;i<=maxi=maxk;i++){
for(j=1;j<=maxj;j++){
if(seqcode1[i] == seqcode2[j]) {
printf("%s %s %s\n",seqcode1[i],seqcode2[j],wholename[j]);
}
else
print seqcode1[i];
print seqFASTA[k];
}
}
}
我认为问题可能在于声明 seqFASTA 但我不确定在哪里。
非常感谢! M。
I have a problem on comparing two files of different no. of lines and use info from a file to update the other. I have tried various key examples that I found online but none seems to work.
Hopefully you could help me with this one.
I have two files:
$ cat 1.txt
>01234
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>13920
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
(>1000 lines)
$ cat 2.txt
hbcd/efgh/z-01234/2000
hbcd/efgh/zw-11000/2000
hbcd/efgh/t-13290/2000
...
...
(<1000 lines)
My intention is to have an updated file 1.txt
, with updated lines but also to keep the lines that are not matched in file 2.txt
, so could be saved as a new file as follows:
$ cat 3.txt
>abcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>abcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
I have tried something like this:
awk 'NR==FNR{a[$0]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 1.txt 2.txt > 3.txt
or like this (to search based on the substring in 1.txt):
awk 'NR==FNR{a[substr($0,2,5)]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 2.txt 1.txt > 3.txt
but the I get mixed up lines in the output file, something like this (or even without the lines from 2.txt, respectively):
01234
17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
abcd/efgh/t-13290/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
I haven't used awk for very long time and I'm not sure how the arrays and keys work.
Update:
I have tried to write an awk script, to do the above. The condition to check them works but somehow I still have a problem with writing the lines from 1.txt that don't match the ones from 2.txt.
BEGIN{
i = 0;
j = 0;
k = 0;
maxi = 0;
maxj = 0;
maxk = 0;
FS = "\/";
}
FILENAME == ARGV[1]{
header1=substr($0,1,1);
if(header1==">"){
++maxi;
seqcode1[maxi]=substr($0,2,5);
# printf("%s\n",seqcode1[maxi]);
}
else if(header1!=">"){
++maxk;
seqFASTA[maxk]=$0;
# print seqFASTA[maxk];
}
}
FILENAME == ARGV[2]{
header2=substr($0,1,1);
if(header2=="h"){
++maxj;
wholename[maxj]=$0;
seqcode2[maxj]=substr($3,4,5);
# printf("%s\n",seqcode2[maxj]);
}
}
END{
for(i=1;i<=maxi=maxk;i++){
for(j=1;j<=maxj;j++){
if(seqcode1[i] == seqcode2[j]) {
printf("%s %s %s\n",seqcode1[i],seqcode2[j],wholename[j]);
}
else
print seqcode1[i];
print seqFASTA[k];
}
}
}
I think the problem may be with declaring seqFASTA but I'm not sure where.
Thank you very much!
M.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我假设
1.txt
中的13920
应该是13290
。以下是一些替代解决方案:
I'm assuming
13920
should be13290
in1.txt
.Here are some alternate solutions: