如何根据多次出现的公共标识符(在另一字段中)用来自不同文件的数据替换一个字段中的数据
我仍然是一个初学者,我发现的所有线程听起来像是我的问题与 SQL 相关,或者答案是连接或合并,这不起作用,因为标识符在 file2 中多次出现。 我有两个制表符分隔的文件。其中一列包含两列,一列具有唯一的数字标识符,另一列具有分类。有数千个。 第二个包含我的数据,其中一个字段包含标识符。有数百行,因此许多标识符不存在,而其他标识符则多次出现。 我需要将 file2 中存储的数据与 file1 中存储的分类连接起来, 基于两个文件中的标识符。
文件 1:
12345 kitchen; furniture; table
12346 kitchen; furniture; chair
12347 living room; furniture; sofa
12348 living room; furniture; table
12349 bed room; furniture; bed
文件 2:
stuff1 mo_restuff somenumbers anotherfield 12348
stuff2 morestuff othernumbers anotherfield 12346
stuff3 more_stuff somenumbers anotherfield 12347
stuff4 morestuff somenumbers yetanotherfield 12347
stuff5 morest.uff alsonumbers anotherfield 12345
结果应该如下所示:
stuff1 mo_restuff somenumbers anotherfield living room; furniture; table
stuff2 morestuff othernumbers anotherfield kitchen; furniture; chair
stuff3 more_stuff somenumbers anotherfield living room; furniture; sofa
stuff4 morestuff somenumbers yetanotherfield living room; furniture; sofa
stuff5 morest.uff alsonumbers anotherfield kitchen; furniture; table
我尝试了(以及许多其他操作),
awk -F "\t" 'BEGIN { OFS=FS } NR==FNR { a[$1]=$0 ; next } ($5) in a { print a,$0 } ' file1 file2 > out
但这只是打印了 file2。
我正在 Unix 上工作,最好是 bash 中的解决方案,但 python 也可以。
也感谢之前对我有帮助的其他问题的所有答案!
I am still quite a beginner and all threads I found that sounded like my problme were SQL related or the answer was join or merge, which does not work because the identifier is present multiple times in file2.
I have two tab delimited files. One contains two columns, one with a unique numerical identifier, the other with a classification. There are thousands of them.
The second one contains my data with one field containing an identifier. There are hundreds of lines, so many identifiers are not present while others are present multiple times.
I need to connect my data stored in file2 with the classification stored in file1,
based in the identifier in both files.
file 1:
12345 kitchen; furniture; table
12346 kitchen; furniture; chair
12347 living room; furniture; sofa
12348 living room; furniture; table
12349 bed room; furniture; bed
file 2:
stuff1 mo_restuff somenumbers anotherfield 12348
stuff2 morestuff othernumbers anotherfield 12346
stuff3 more_stuff somenumbers anotherfield 12347
stuff4 morestuff somenumbers yetanotherfield 12347
stuff5 morest.uff alsonumbers anotherfield 12345
The result should look like:
stuff1 mo_restuff somenumbers anotherfield living room; furniture; table
stuff2 morestuff othernumbers anotherfield kitchen; furniture; chair
stuff3 more_stuff somenumbers anotherfield living room; furniture; sofa
stuff4 morestuff somenumbers yetanotherfield living room; furniture; sofa
stuff5 morest.uff alsonumbers anotherfield kitchen; furniture; table
I tried (amongst many other things)
awk -F "\t" 'BEGIN { OFS=FS } NR==FNR { a[$1]=$0 ; next } ($5) in a { print a,$0 } ' file1 file2 > out
but this just printed file2.
I am working on Unix, best would be a solution in bash but python would also work.
Thanks also for all the answers to other questions that helped me before!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的脚本很接近,但是操作
{打印A,$ 0}
很奇怪。您能否尝试以下内容:
输出:
fr == fnr
,存储$ 2
将足够好,而不是$ 0
。file2
的下一个块中,只需通过用a
索引$ 5
索引的值来修改第五字段。1
告诉Awk打印$ 0
,其中第5个字段如上所述。Your script is close, but the action
{ print a,$0 }
is strange.Would you please try the following:
Output:
FR==FNR
, storing$2
will be good enough, not$0
.file2
, just modify the 5th field by replacing with the value ofa
indexed by$5
.1
tells awk to print$0
, of which the 5th field is modified as above.