在一步(命令)中按列连接两个具有相同标识符的制表符分隔文件?
我经常想连接两个 ascii 文件,它们都是表,因为它们由制表符分隔的列组成,如下所示:
file 1
FRUIT ID
apple alpha
banana beta
cherry gamma
file 2
ID FOOBAR
alpha cat
beta dog
delta airplane
我想用内部连接将它们连接起来:
FRUIT ID FOOBAR
apple alpha cat
banana beta dog
或用左连接:(
FRUIT ID FOOBAR
apple alpha cat
banana beta dog
cherry gamma n/a
用于连接的标识符不一定是唯一的。)
到目前为止我正在做的是:
- 复制不带标题的输入文件。
- 按列对输入文件进行排序。
- 对已排序的版本使用 linux join 命令。
- 删除中间文件。
这很容易出错,因为我需要对列进行计数以稍后指定它们以按数字“排序”和“连接”(对于很多列和非常宽的列更容易出错),我一定不要忘记指定选项卡是分隔符并且需要每次删除/插入/修复标题等。
任何人都可以推荐一种更简单的方法吗?最好是不需要排序并且可以按名称而不是数字指定列的地方?像“joincommand ID file1 file2 > result”之类的东西?
Very often I want to join two ascii-files, which are both tables in the sense that they consist of columns separated by tab, like this:
file 1
FRUIT ID
apple alpha
banana beta
cherry gamma
file 2
ID FOOBAR
alpha cat
beta dog
delta airplane
and I want to join them like this with an inner join:
FRUIT ID FOOBAR
apple alpha cat
banana beta dog
or with a left join:
FRUIT ID FOOBAR
apple alpha cat
banana beta dog
cherry gamma n/a
(The identifiers used for joining are not necessarily unique.)
What I am doing so far is:
- Make copies of the input files without header.
- Sort the input files by column.
- Use the linux join command on the sorted versions.
- Delete intermediate files.
This is error prone as I need to count the columns to specify them later to "sort" and "join" by number (even more error prone with lots of columns and very broad columns), I must not forget to specify that tab is the delimiter and need to remove/insert/fix the header each time etc.
Can anyone recommend a much simpler way? Prefereably where I don't need to sort and where I can specify the column by name, not number? Something like "joincommand ID file1 file2 > result"?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一种完全不同的方法是使用轻量级 SQL 工具,例如 sqlite。
您可以创建两个表:
将 TAB 设置为分隔符并加载文件:
删除标头:
然后执行您需要的所有查询:
还可以在 bash 此处文档的帮助下自动执行该任务:
A completely different approach would be to use a lightweight SQL tool, like sqlite.
You can create two tables:
set TAB as a separator and load your files:
delete headers:
then perform all queries you need:
It is also possible to automate the task with the help of bash here docs:
您可以使用 bash 脚本自动执行您的任务,并且不使用临时文件,就像在这个示例中一样:
它不执行任何错误检查,并且也许可以使用其他工具(例如 python)以更优雅的方式实现,但是我希望它有帮助。
You can automate your task with a bash script, and without the use of temporary files, like in this example:
It do not do any error check, and maybe can be realized with other tools, like python, in a more elegant way, but I hope it help.