awk 根据公共字段合并两个文件并打印异同

发布于 2024-10-08 17:46:26 字数 1713 浏览 2 评论 0原文

我有两个文件，我想合并到第三个文件中，但我需要查看它们何时共享公共字段以及它们的不同之处。由于其他字段存在细微差别，我无法使用 diff 工具，我认为这可能是用awk完成。

文件1：

aWonderfulMachine             1   mlqsjflk          
AnotherWonderfulMachine     2   mlksjf          
YetAnother WonderfulMachine 3   sdg         
TrashWeWon'tBuy             4   jhfgjh          
MoreTrash                     5   qsfqf         
MiscelleneousStuff           6  qfsdf           
MoreMiscelleneousStuff       7  qsfwsf

文件2：

aWonderfulMachine             22    dfhdhg          
aWonderfulMachine             23    dfhh            
aWonderfulMachine             24    qdgfqf          
AnotherWonderfulMachine     25    qsfsq         
AnotherWonderfulMachine     26    qfwdsf            
MoreDifferentStuff           27    qsfsdf           
StrangeStuffBought           28    qsfsdf

所需的输出：

aWonderfulMachine   1   mlqsjflk    aWonderfulMachine   22  dfhdhg
                                     aWonderfulMachine  23  dfhdhg
                                     aWonderfulMachine  24  dfhh
AnotherWonderfulMachine 2   mlksjf  AnotherWonderfulMachine 25  qfwdsf
                                       AnotherWonderfulMachine  26  qfwdsf
File1
YetAnother WonderfulMachine 3   sdg         
TrashWeWon'tBuy             4   jhfgjh          
MoreTrash                     5   qsfqf         
MiscelleneousStuff           6   qfsdf          
MoreMiscelleneousStuff       7   qsfwsf         
File2                   
MoreDifferentStuff          27  qsfsdf          
StrangeStuffBought          28  qsfsdf

我在这里和那里尝试了一些awks脚本，但它们要么仅基于两个字段，并且我不知道如何修改输出，要么它们根据两个字段删除重复项仅等（我对此很陌生，而且 awk 语法很困难）。预先非常感谢您的帮助。

原文

I have two files I would like to merge into a third but I need to see both when they share a common field and where they differ.Since there are minor differences in other fields, I cannot use a diff tool and I thought this could be done with awk.

File 1:

aWonderfulMachine             1   mlqsjflk          
AnotherWonderfulMachine     2   mlksjf          
YetAnother WonderfulMachine 3   sdg         
TrashWeWon'tBuy             4   jhfgjh          
MoreTrash                     5   qsfqf         
MiscelleneousStuff           6  qfsdf           
MoreMiscelleneousStuff       7  qsfwsf

File2:

aWonderfulMachine             22    dfhdhg          
aWonderfulMachine             23    dfhh            
aWonderfulMachine             24    qdgfqf          
AnotherWonderfulMachine     25    qsfsq         
AnotherWonderfulMachine     26    qfwdsf            
MoreDifferentStuff           27    qsfsdf           
StrangeStuffBought           28    qsfsdf

Desired output:

aWonderfulMachine   1   mlqsjflk    aWonderfulMachine   22  dfhdhg
                                     aWonderfulMachine  23  dfhdhg
                                     aWonderfulMachine  24  dfhh
AnotherWonderfulMachine 2   mlksjf  AnotherWonderfulMachine 25  qfwdsf
                                       AnotherWonderfulMachine  26  qfwdsf
File1
YetAnother WonderfulMachine 3   sdg         
TrashWeWon'tBuy             4   jhfgjh          
MoreTrash                     5   qsfqf         
MiscelleneousStuff           6   qfsdf          
MoreMiscelleneousStuff       7   qsfwsf         
File2                   
MoreDifferentStuff          27  qsfsdf          
StrangeStuffBought          28  qsfsdf

I have tried a few awks scripts here and there, but they are either based on two fields only, and I don't know how to modify the output, or they delete the duplicates based on two fields only, etc (I am new to this and awk syntax is tough).
Thank you much in advance for your help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吹梦到西洲 2024-10-15 17:46:26

使用这三个命令可以非常接近：

join <(sort file1) <(sort file2)
join -v 1 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)

这假设有一个 shell，例如 Bash，支持进程替换 (<())。如果您使用的 shell 不支持，则需要对文件进行预先排序。

要在 AWK 中执行此操作：

#!/usr/bin/awk -f
BEGIN { FS="\t"; flag=1; file1=ARGV[1]; file2=ARGV[2] }
FNR == NR { lines1[$1] = $0; count1[$1]++; next }  # process the first file
{   # process the second file and do output
    lines2[$1] = $0;
    count2[$1]++;
    if ($1 != prev) { flag = 1 };
    if (count1[$1]) {
        if (flag) printf "%s ", lines1[$1];
        else printf "\t\t\t\t\t"
        flag = 0;
        printf "\t%s\n", $0
    }
    prev = $1
}
END { # output lines that are unique to one file or the other
    print "File 1: " file1
    for (i in lines1) if (! (i in lines2)) print lines1[i]
    print "File 2: " file2
    for (i in lines2) if (! (i in lines1)) print lines2[i]
}

要运行它：

$ ./script.awk file1 file2

这些行将不会按照它们在输入文件中出现的顺序输出。第二个输入文件 (file2) 需要排序，因为脚本假定相似的行是相邻的。您可能需要调整脚本中的制表符或其他间距。我在这方面并没有做太多的事情。

You can come very close using these three commands:

join <(sort file1) <(sort file2)
join -v 1 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)

This assumes a shell, such as Bash, that supports process substitution (<()). If you're using a shell that doesn't, the files would need to be pre-sorted.

To do this in AWK:

#!/usr/bin/awk -f
BEGIN { FS="\t"; flag=1; file1=ARGV[1]; file2=ARGV[2] }
FNR == NR { lines1[$1] = $0; count1[$1]++; next }  # process the first file
{   # process the second file and do output
    lines2[$1] = $0;
    count2[$1]++;
    if ($1 != prev) { flag = 1 };
    if (count1[$1]) {
        if (flag) printf "%s ", lines1[$1];
        else printf "\t\t\t\t\t"
        flag = 0;
        printf "\t%s\n", $0
    }
    prev = $1
}
END { # output lines that are unique to one file or the other
    print "File 1: " file1
    for (i in lines1) if (! (i in lines2)) print lines1[i]
    print "File 2: " file2
    for (i in lines2) if (! (i in lines1)) print lines2[i]
}

To run it:

$ ./script.awk file1 file2

The lines won't be output in the same order that they appear in the input files. The second input file (file2) needs to be sorted since the script assumes that similar lines are adjacent. You will probably want to adjust the tabs or other spacing in the script. I haven't done much in that regard.

回复收藏 0 原文

撩人痒 2024-10-15 17:46:26

一种方法（尽管使用硬编码的文件名）：

BEGIN {
    FS="\t"; 
    readfile(ARGV[1], s1); 
    readfile(ARGV[2], s2); 
    ARGV[1] = ARGV[2] = "/dev/null"
}
END{
    for (k in s1) {
    if ( s2[k] ) printpair(k,s1,s2);
    }
    print "file1:"
    for (k in s1) {
    if ( !s2[k] ) print s1[k];
    }
    print "file2:"
    for (k in s2) {
    if ( !s1[k] ) print s2[k];
    }
}
function readfile(fname, sary) {
    while ( getline <fname ) {
    key = $1;
    if (sary[key]) {
        sary[key] = sary[key] "\n" $0; 
    } else {
        sary[key] = $0;
    };
    }
    close(fname);
}
function printpair(key, s1, s2) {
    n1 = split(s1[key],l1,"\n");
    n2 = split(s2[key],l2,"\n");
    for (i=1; i<=max(n1,n2); i++){
    if (i==1) {
        b = l1[1]; 
        gsub("."," ",b);
    }
    if (i<=n1) { f1 = l1[i] } else { f1 = b };
    if (i<=n2) { f2 = l2[i] } else { f2 = b };
    printf("%s\t%s\n",f1,f2);
    }
}
function max(x,y){ z = x; if (y>x) z = y; return z; }

不是特别优雅，但它可以处理多对多的情况。

One way to do it (albeit with hardcoded file names):

BEGIN {
    FS="\t"; 
    readfile(ARGV[1], s1); 
    readfile(ARGV[2], s2); 
    ARGV[1] = ARGV[2] = "/dev/null"
}
END{
    for (k in s1) {
    if ( s2[k] ) printpair(k,s1,s2);
    }
    print "file1:"
    for (k in s1) {
    if ( !s2[k] ) print s1[k];
    }
    print "file2:"
    for (k in s2) {
    if ( !s1[k] ) print s2[k];
    }
}
function readfile(fname, sary) {
    while ( getline <fname ) {
    key = $1;
    if (sary[key]) {
        sary[key] = sary[key] "\n" $0; 
    } else {
        sary[key] = $0;
    };
    }
    close(fname);
}
function printpair(key, s1, s2) {
    n1 = split(s1[key],l1,"\n");
    n2 = split(s2[key],l2,"\n");
    for (i=1; i<=max(n1,n2); i++){
    if (i==1) {
        b = l1[1]; 
        gsub("."," ",b);
    }
    if (i<=n1) { f1 = l1[i] } else { f1 = b };
    if (i<=n2) { f2 = l2[i] } else { f2 = b };
    printf("%s\t%s\n",f1,f2);
    }
}
function max(x,y){ z = x; if (y>x) z = y; return z; }

Not particularly elegant, but it handles many-to-many cases.

回复收藏 0 原文

~没有更多了~