在一步(命令)中按列连接两个具有相同标识符的制表符分隔文件?

发布于 2024-09-25 20:08:37 字数 824 浏览 3 评论 0原文

我经常想连接两个 ascii 文件,它们都是表,因为它们由制表符分隔的列组成,如下所示:

file 1

FRUIT   ID
apple   alpha
banana  beta
cherry  gamma

file 2

ID  FOOBAR
alpha   cat
beta    dog
delta   airplane

我想用内部连接将它们连接起来:

FRUIT   ID  FOOBAR
apple   alpha   cat
banana  beta    dog

或用左连接:(

FRUIT   ID  FOOBAR
apple   alpha   cat
banana  beta    dog
cherry  gamma   n/a

用于连接的标识符不一定是唯一的。)

到目前为止我正在做的是:

  1. 复制不带标题的输入文件。
  2. 按列对输入文件进行排序。
  3. 对已排序的版本使用 linux join 命令。
  4. 删除中间文件。

这很容易出错,因为我需要对列进行计数以稍后指定它们以按数字“排序”和“连接”(对于很多列和非常宽的列更容易出错),我一定不要忘记指定选项卡是分隔符并且需要每次删除/插入/修复标题等。

任何人都可以推荐一种更简单的方法吗?最好是不需要排序并且可以按名称而不是数字指定列的地方?像“joincommand ID file1 file2 > result”之类的东西?

Very often I want to join two ascii-files, which are both tables in the sense that they consist of columns separated by tab, like this:

file 1

FRUIT   ID
apple   alpha
banana  beta
cherry  gamma

file 2

ID  FOOBAR
alpha   cat
beta    dog
delta   airplane

and I want to join them like this with an inner join:

FRUIT   ID  FOOBAR
apple   alpha   cat
banana  beta    dog

or with a left join:

FRUIT   ID  FOOBAR
apple   alpha   cat
banana  beta    dog
cherry  gamma   n/a

(The identifiers used for joining are not necessarily unique.)

What I am doing so far is:

  1. Make copies of the input files without header.
  2. Sort the input files by column.
  3. Use the linux join command on the sorted versions.
  4. Delete intermediate files.

This is error prone as I need to count the columns to specify them later to "sort" and "join" by number (even more error prone with lots of columns and very broad columns), I must not forget to specify that tab is the delimiter and need to remove/insert/fix the header each time etc.

Can anyone recommend a much simpler way? Prefereably where I don't need to sort and where I can specify the column by name, not number? Something like "joincommand ID file1 file2 > result"?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

尘曦 2024-10-02 20:08:37

一种完全不同的方法是使用轻量级 SQL 工具,例如 sqlite。

您可以创建两个表:

$ sqlite3
SQLite version 3.7.2 
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table fruit (fruit varchar(20), id varchar(20));
sqlite> create table foobar (id varchar(20), foobar varchar(20));

将 TAB 设置为分隔符并加载文件:

sqlite> .separator "\t"
sqlite> .import file1 fruit
sqlite> .import file2 foobar

删除标头:

sqlite> delete from fruit where id = 'ID';
sqlite> delete from foobar where id = 'ID';

然后执行您需要的所有查询:

sqlite> select fruit.id, fruit, foobar from fruit, foobar where fruit.id = foobar.id;
alpha   apple   cat
beta    banana  dog
sqlite> .quit
$ 

还可以在 bash 此处文档的帮助下自动执行该任务:

#!/bin/bash

sqlite3 <<-EOF
        create table fruit (fruit varchar(20), id varchar(20));
        create table foobar (id varchar(20), foobar varchar(20));
        .separator "\t"
        .import file1 fruit
        .import file2 foobar
        delete from fruit where id = 'ID';
        delete from foobar where id = 'ID';
        select fruit.id, fruit, foobar from fruit, foobar where fruit.id = foobar.id;
        .quit
EOF

A completely different approach would be to use a lightweight SQL tool, like sqlite.

You can create two tables:

$ sqlite3
SQLite version 3.7.2 
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table fruit (fruit varchar(20), id varchar(20));
sqlite> create table foobar (id varchar(20), foobar varchar(20));

set TAB as a separator and load your files:

sqlite> .separator "\t"
sqlite> .import file1 fruit
sqlite> .import file2 foobar

delete headers:

sqlite> delete from fruit where id = 'ID';
sqlite> delete from foobar where id = 'ID';

then perform all queries you need:

sqlite> select fruit.id, fruit, foobar from fruit, foobar where fruit.id = foobar.id;
alpha   apple   cat
beta    banana  dog
sqlite> .quit
$ 

It is also possible to automate the task with the help of bash here docs:

#!/bin/bash

sqlite3 <<-EOF
        create table fruit (fruit varchar(20), id varchar(20));
        create table foobar (id varchar(20), foobar varchar(20));
        .separator "\t"
        .import file1 fruit
        .import file2 foobar
        delete from fruit where id = 'ID';
        delete from foobar where id = 'ID';
        select fruit.id, fruit, foobar from fruit, foobar where fruit.id = foobar.id;
        .quit
EOF
猫性小仙女 2024-10-02 20:08:37

您可以使用 bash 脚本自动执行您的任务,并且不使用临时文件,就像在这个示例中一样:

#!/bin/bash

id="$1"
file1="$2"
file2="$3"

# get a filename as a parameter
# read first line of file to get $id position
get_pos() {
  awk -v id="$id" '{
      for (i = 1; i <= NF; i++)
        if ($i == id) {
          print i
          exit
        }
    }' "$1"
}

# get $id positions from headers of the two files
pos1=$(get_pos "$file1")
pos2=$(get_pos "$file2")

# print header
printf "%s\t" "$id"
head -n1 "$file1" | sed -r "s/$id(\t|$)//" | tr -d '\n'
head -n1 "$file2" | sed -r "s/$id(\t|$)//"

# print data, add -a1 option for left join
join -t

它不执行任何错误检查,并且也许可以使用其他工具(例如 python)以更优雅的方式实现,但是我希望它有帮助。

\t' -1 $pos1 -2 $pos2 \ <(tail -n+2 "$file1" | sort) \ <(tail -n+2 "$file2" | sort)

它不执行任何错误检查,并且也许可以使用其他工具(例如 python)以更优雅的方式实现,但是我希望它有帮助。

You can automate your task with a bash script, and without the use of temporary files, like in this example:

#!/bin/bash

id="$1"
file1="$2"
file2="$3"

# get a filename as a parameter
# read first line of file to get $id position
get_pos() {
  awk -v id="$id" '{
      for (i = 1; i <= NF; i++)
        if ($i == id) {
          print i
          exit
        }
    }' "$1"
}

# get $id positions from headers of the two files
pos1=$(get_pos "$file1")
pos2=$(get_pos "$file2")

# print header
printf "%s\t" "$id"
head -n1 "$file1" | sed -r "s/$id(\t|$)//" | tr -d '\n'
head -n1 "$file2" | sed -r "s/$id(\t|$)//"

# print data, add -a1 option for left join
join -t

It do not do any error check, and maybe can be realized with other tools, like python, in a more elegant way, but I hope it help.

\t' -1 $pos1 -2 $pos2 \ <(tail -n+2 "$file1" | sort) \ <(tail -n+2 "$file2" | sort)

It do not do any error check, and maybe can be realized with other tools, like python, in a more elegant way, but I hope it help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文