通过连接的第一列连接 3 个文件(是 awk)?

发布于 2024-08-31 20:28:35 字数 405 浏览 3 评论 0原文

我有三个类似的文件,它们都是这样的:

文件 A

ID1 Value1a
ID2 Value2a
  .
  .
  .
IDN Value2n

我想要这样的输出

输出

ID1 Value1a Value1b Value1c
ID2 Value2a Value2b Value2c
.....
IDN ValueNa ValueNb ValueNc

看第一行,我希望 value1A 是 fileA 中 id1 的值,value1B 是 fileB 中 id1 的值,等等其中每个字段和每一行。我认为它就像一个 sql 连接。我已经尝试了几件事,但没有一个是接近的。

编辑:所有文件具有相同的长度和 ID。

i have three similar files, they are all like this:

File A

ID1 Value1a
ID2 Value2a
  .
  .
  .
IDN Value2n

and i want an output like this

Output

ID1 Value1a Value1b Value1c
ID2 Value2a Value2b Value2c
.....
IDN ValueNa ValueNb ValueNc

Looking to the first line, i want value1A to be the value of id1 in fileA, value1B the value of id1 in fileB, and so on which each field and each line. I thougth it like a sql join. I've tried several things but none of them where even close.

EDIT: All files have the same length and ids.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

摇划花蜜的午后 2024-09-07 20:28:36

尝试join(1)

join fileA fileB | join - fileC

Give join(1) a try:

join fileA fileB | join - fileC
孤云独去闲 2024-09-07 20:28:36

join (丹尼斯的答案)更好,但只是为了好玩,这是我在 awk 中想到的:

awk '{a=$0; getline b <"fileB"; getline c <"fileC"; $0=a" "b" "c; print $1,$2,$4,$6}' <fileA

join (Dennis's answer) is better, but just for kicks, here's what I came up with in awk:

awk '{a=$0; getline b <"fileB"; getline c <"fileC"; $0=a" "b" "c; print $1,$2,$4,$6}' <fileA
本宫微胖 2024-09-07 20:28:36

更新:问题已被编辑为所有文件都包含所有密钥,因此接受的答案(join)肯定比这个更好。仅当密钥可能不在所有文件中时才考虑使用此选项。


如果您不太关心性能,您可以尝试快速而肮脏的方法:

$ cat file_a
    ID5 Value5a
    ID1 Value1a
    ID3 Value3a
    ID4 Value4a
    ID2 Value2a
$ cat file_b
    ID1 Value1b
    ID3 Value3b
$ cat file_c
    ID2 Value2c
    ID3 Value3c
    ID4 Value4c
    ID5 Value5c
$ cat qq.sh
    #!/bin/bash
    keylist=$(awk '{print $1'} file_[abc] | sort | uniq)
    for key in ${keylist} ; do
        val_a=$(grep "^${key} " file_a | awk '{print $2}') ; val_a=${val_a:--}
        val_b=$(grep "^${key} " file_b | awk '{print $2}') ; val_b=${val_b:--}
        val_c=$(grep "^${key} " file_c | awk '{print $2}') ; val_c=${val_c:--}
        echo ${key} ${val_a} ${val_b} ${val_c}
    done
$ ./qq.sh
    ID1 Value1a Value1b -
    ID2 Value2a - Value2c
    ID3 Value3a Value3b Value3c
    ID4 Value4a - Value4c
    ID5 Value5a - Value5c

这实际上首先计算出键,然后使用该键从每个文件中获取值,或者 - 如果它不在相关文件中。

如果文件更复杂(如果字段 1 不在行的开头或者后面跟有非空格分隔符),则需要调整 grep 命令,但这应该是合理的首切解决方案。在这种情况下可能使用的 grep 是:

grep "^[ X]*${key}[ X]"

其中 X 实际上是 tab 字符,因为这允许零个或多个空格或键前的制表符以及用于终止该键的空格或制表符。

如果文件特别大,您可能需要考虑使用 awk 中的关联数组,但是,由于没有指示大小,我会从这个开始,直到您到达以下位置:它运行得太慢了。

Update: The question has been edited to state that all files contain all keys, so the accepted answer (join) is definitely better than this one. Only consider using this one if it's possible the keys may not be in all files.


If you're not too concerned about performance, you could try the quick and dirty:

$ cat file_a
    ID5 Value5a
    ID1 Value1a
    ID3 Value3a
    ID4 Value4a
    ID2 Value2a
$ cat file_b
    ID1 Value1b
    ID3 Value3b
$ cat file_c
    ID2 Value2c
    ID3 Value3c
    ID4 Value4c
    ID5 Value5c
$ cat qq.sh
    #!/bin/bash
    keylist=$(awk '{print $1'} file_[abc] | sort | uniq)
    for key in ${keylist} ; do
        val_a=$(grep "^${key} " file_a | awk '{print $2}') ; val_a=${val_a:--}
        val_b=$(grep "^${key} " file_b | awk '{print $2}') ; val_b=${val_b:--}
        val_c=$(grep "^${key} " file_c | awk '{print $2}') ; val_c=${val_c:--}
        echo ${key} ${val_a} ${val_b} ${val_c}
    done
$ ./qq.sh
    ID1 Value1a Value1b -
    ID2 Value2a - Value2c
    ID3 Value3a Value3b Value3c
    ID4 Value4a - Value4c
    ID5 Value5a - Value5c

This actually works out the keys first then gets the values from each file with that key, or - if it's not in the relevant file.

The grep commands will need to be adjusted if the file is more complex (either if field 1 isn't at the start of the line or is followed by a non-space separator) but this should be a reasonable first-cut solution. The likely grep to use in that case would be:

grep "^[ X]*${key}[ X]"

where X is actually the tab character, as this allows for zero-or-more spaces or tabs before the key and a space or tab to terminate the key.

If the files are particularly large, you may want to look into using the associative arrays within awk but, since there's no indication of the size, I'd start with this one until you get to the point where it's running too slow.

心头的小情儿 2024-09-07 20:28:36

只是补充一点,为了使连接正常工作,应该对输入进行排序。
这个 awk 解决方案应该处理任意数量的输入文件。
您还将丢失键的原始顺序(您将需要更多代码来保留它)。

awk 'END {
  for (K in k) print K, k[K]
    }
{ 
  k[$1] = k[$1] ? k[$1] FS $2 : $2 
  }' file1 file2 [.. filen]

Just to add that in order for join to work the input should be sorted.
This awk solution should handle any number of input files.
You will also loose the original order of the keys (you'll need more code to preserve it).

awk 'END {
  for (K in k) print K, k[K]
    }
{ 
  k[$1] = k[$1] ? k[$1] FS $2 : $2 
  }' file1 file2 [.. filen]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文