AWK - 比较两个 1 列文件中的匹配字符串，然后写入包含更新信息的文件

发布于 2025-01-16 19:27:43 字数 2326 浏览 2 评论 0原文

我在比较两个不同编号的文件时遇到问题。行并使用文件中的信息来更新另一个文件。我尝试了在网上找到的各种关键示例，但似乎都不起作用。

希望你能帮我解决这个问题。

我有两个文件：

$ cat 1.txt
>01234
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>13920
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
(>1000 lines)


$ cat 2.txt
hbcd/efgh/z-01234/2000
hbcd/efgh/zw-11000/2000
hbcd/efgh/t-13290/2000
...
...
(<1000 lines)

我的目的是拥有一个更新的文件 1.txt，其中包含更新的行，但也保留文件 2.txt 中不匹配的行，所以可以保存为新文件，如下所示：

$ cat 3.txt
>abcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>abcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...

我尝试过这样的操作： awk 'NR==FNR{a[$0]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 1.txt 2.txt > 3.txt

或类似这样（根据 1.txt 中的子字符串进行搜索）： awk 'NR==FNR{a[substr($0,2,5)]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 2.txt 1 .txt> 3.txt

但我在输出文件中得到了混淆的行，如下所示（或者甚至没有分别来自 2.txt 的行）：

01234 17321 SSSSSSKKKKKKLLLIIIMMMMMMNNNNNNNAAAA abcd/efgh/t-13290/2000 NNNNNNNNAAAAAANNNNNNNAAAAANNNNNAAAAANNNNAA ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN

我已经很长时间没有使用 awk 了，我不确定数组和键是如何工作的。

更新：我尝试编写一个 awk 脚本来执行上述操作。检查它们的条件有效，但不知怎的，我在编写 1.txt 中的行与 2.txt 中的行不匹配时仍然遇到问题。

BEGIN{
    i = 0;
    j = 0;
    k = 0;
    maxi = 0;
    maxj = 0;
    maxk = 0;
    FS = "\/";
}

FILENAME == ARGV[1]{
    header1=substr($0,1,1);
    if(header1==">"){
        ++maxi;
        seqcode1[maxi]=substr($0,2,5);
#       printf("%s\n",seqcode1[maxi]);
    }
    else if(header1!=">"){
        ++maxk;
        seqFASTA[maxk]=$0;
#       print seqFASTA[maxk];
    }
}

FILENAME == ARGV[2]{
    header2=substr($0,1,1);
    if(header2=="h"){
        ++maxj;
        wholename[maxj]=$0;
        seqcode2[maxj]=substr($3,4,5);
#       printf("%s\n",seqcode2[maxj]);
    }
}

END{
    for(i=1;i<=maxi=maxk;i++){
      for(j=1;j<=maxj;j++){
        if(seqcode1[i] == seqcode2[j]) {
            printf("%s %s %s\n",seqcode1[i],seqcode2[j],wholename[j]);
        }
        else
          print seqcode1[i];
          print seqFASTA[k];
        }
    }
}

我认为问题可能在于声明 seqFASTA 但我不确定在哪里。

非常感谢！ M。

原文

I have a problem on comparing two files of different no. of lines and use info from a file to update the other. I have tried various key examples that I found online but none seems to work.

Hopefully you could help me with this one.

I have two files:

$ cat 1.txt
>01234
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>13920
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
(>1000 lines)


$ cat 2.txt
hbcd/efgh/z-01234/2000
hbcd/efgh/zw-11000/2000
hbcd/efgh/t-13290/2000
...
...
(<1000 lines)

My intention is to have an updated file 1.txt, with updated lines but also to keep the lines that are not matched in file 2.txt, so could be saved as a new file as follows:

$ cat 3.txt
>abcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>abcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...

I have tried something like this:
awk 'NR==FNR{a[$0]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 1.txt 2.txt > 3.txt

or like this (to search based on the substring in 1.txt):
awk 'NR==FNR{a[substr($0,2,5)]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 2.txt 1.txt > 3.txt

but the I get mixed up lines in the output file, something like this (or even without the lines from 2.txt, respectively):

01234
17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
abcd/efgh/t-13290/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN

I haven't used awk for very long time and I'm not sure how the arrays and keys work.

Update:
I have tried to write an awk script, to do the above. The condition to check them works but somehow I still have a problem with writing the lines from 1.txt that don't match the ones from 2.txt.

BEGIN{
    i = 0;
    j = 0;
    k = 0;
    maxi = 0;
    maxj = 0;
    maxk = 0;
    FS = "\/";
}

FILENAME == ARGV[1]{
    header1=substr($0,1,1);
    if(header1==">"){
        ++maxi;
        seqcode1[maxi]=substr($0,2,5);
#       printf("%s\n",seqcode1[maxi]);
    }
    else if(header1!=">"){
        ++maxk;
        seqFASTA[maxk]=$0;
#       print seqFASTA[maxk];
    }
}

FILENAME == ARGV[2]{
    header2=substr($0,1,1);
    if(header2=="h"){
        ++maxj;
        wholename[maxj]=$0;
        seqcode2[maxj]=substr($3,4,5);
#       printf("%s\n",seqcode2[maxj]);
    }
}

END{
    for(i=1;i<=maxi=maxk;i++){
      for(j=1;j<=maxj;j++){
        if(seqcode1[i] == seqcode2[j]) {
            printf("%s %s %s\n",seqcode1[i],seqcode2[j],wholename[j]);
        }
        else
          print seqcode1[i];
          print seqFASTA[k];
        }
    }
}

I think the problem may be with declaring seqFASTA but I'm not sure where.

Thank you very much!
M.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱给你人给你 2025-01-23 19:27:43

我假设 1.txt 中的 13920 应该是 13290。

$ awk 'NR==FNR{split($0, a, "/"); sub(/^[^-]+-/, "", a[3]); map[a[3]]=$0; next}
       (k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt
>hbcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>hbcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN

以下是一些替代解决方案：

# with GNU awk
awk 'NR==FNR{match($0, /-([0-9]+)/, a); map[a[1]]=$0; next}
     (k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt

# assuming '/' and '-' will always be similar to given sample
awk -F'[/-]' 'NR==FNR{map[$4]=">"$0; next}
              $2 in map{$0 = map[$2]} 1' 2.txt FS='>' 1.txt

I'm assuming 13920 should be 13290 in 1.txt.

$ awk 'NR==FNR{split($0, a, "/"); sub(/^[^-]+-/, "", a[3]); map[a[3]]=$0; next}
       (k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt
>hbcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>hbcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN

Here are some alternate solutions:

# with GNU awk
awk 'NR==FNR{match($0, /-([0-9]+)/, a); map[a[1]]=$0; next}
     (k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt

# assuming '/' and '-' will always be similar to given sample
awk -F'[/-]' 'NR==FNR{map[$4]=">"$0; next}
              $2 in map{$0 = map[$2]} 1' 2.txt FS='>' 1.txt

回复收藏 0 原文

~没有更多了~