如何根据多次出现的公共标识符（在另一字段中）用来自不同文件的数据替换一个字段中的数据

发布于 2025-01-21 03:52:15 字数 1437 浏览 1 评论 0原文

我仍然是一个初学者，我发现的所有线程听起来像是我的问题与 SQL 相关，或者答案是连接或合并，这不起作用，因为标识符在 file2 中多次出现。我有两个制表符分隔的文件。其中一列包含两列，一列具有唯一的数字标识符，另一列具有分类。有数千个。第二个包含我的数据，其中一个字段包含标识符。有数百行，因此许多标识符不存在，而其他标识符则多次出现。我需要将 file2 中存储的数据与 file1 中存储的分类连接起来，基于两个文件中的标识符。

文件 1:

12345   kitchen; furniture; table
12346   kitchen; furniture; chair
12347   living room; furniture; sofa
12348   living room; furniture; table
12349   bed room; furniture; bed

文件 2:

stuff1  mo_restuff  somenumbers anotherfield    12348
stuff2  morestuff   othernumbers    anotherfield    12346
stuff3  more_stuff  somenumbers anotherfield    12347
stuff4  morestuff   somenumbers yetanotherfield 12347
stuff5  morest.uff  alsonumbers anotherfield    12345

结果应该如下所示：

stuff1  mo_restuff  somenumbers anotherfield    living room; furniture; table
stuff2  morestuff   othernumbers    anotherfield    kitchen; furniture; chair
stuff3  more_stuff  somenumbers anotherfield    living room; furniture; sofa
stuff4  morestuff   somenumbers yetanotherfield living room; furniture; sofa
stuff5  morest.uff  alsonumbers anotherfield    kitchen; furniture; table

我尝试了（以及许多其他操作），

awk -F "\t" 'BEGIN { OFS=FS } NR==FNR { a[$1]=$0 ; next } ($5) in a  { print a,$0 } ' file1 file2 > out

但这只是打印了 file2。

我正在 Unix 上工作，最好是 bash 中的解决方案，但 python 也可以。

也感谢之前对我有帮助的其他问题的所有答案！

原文

I am still quite a beginner and all threads I found that sounded like my problme were SQL related or the answer was join or merge, which does not work because the identifier is present multiple times in file2.
I have two tab delimited files. One contains two columns, one with a unique numerical identifier, the other with a classification. There are thousands of them.
The second one contains my data with one field containing an identifier. There are hundreds of lines, so many identifiers are not present while others are present multiple times.
I need to connect my data stored in file2 with the classification stored in file1,
based in the identifier in both files.

file 1:

12345   kitchen; furniture; table
12346   kitchen; furniture; chair
12347   living room; furniture; sofa
12348   living room; furniture; table
12349   bed room; furniture; bed

file 2:

stuff1  mo_restuff  somenumbers anotherfield    12348
stuff2  morestuff   othernumbers    anotherfield    12346
stuff3  more_stuff  somenumbers anotherfield    12347
stuff4  morestuff   somenumbers yetanotherfield 12347
stuff5  morest.uff  alsonumbers anotherfield    12345

The result should look like:

stuff1  mo_restuff  somenumbers anotherfield    living room; furniture; table
stuff2  morestuff   othernumbers    anotherfield    kitchen; furniture; chair
stuff3  more_stuff  somenumbers anotherfield    living room; furniture; sofa
stuff4  morestuff   somenumbers yetanotherfield living room; furniture; sofa
stuff5  morest.uff  alsonumbers anotherfield    kitchen; furniture; table

I tried (amongst many other things)

awk -F "\t" 'BEGIN { OFS=FS } NR==FNR { a[$1]=$0 ; next } ($5) in a  { print a,$0 } ' file1 file2 > out

but this just printed file2.

I am working on Unix, best would be a solution in bash but python would also work.

Thanks also for all the answers to other questions that helped me before!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

婴鹅 2025-01-28 03:52:15

您的脚本很接近，但是操作{打印A，$ 0}很奇怪。
您能否尝试以下内容：

awk '
    BEGIN {FS = OFS = "\t"}
    NR==FNR {a[$1] = $2; next}
    {$5 = a[$5]}
1' file1 file2 > out

输出：

stuff1  mo_restuff      somenumbers     anotherfield    living room; furniture; table
stuff2  morestuff       othernumbers    anotherfield    kitchen; furniture; chair
stuff3  more_stuff      somenumbers     anotherfield    living room; furniture; sofa
stuff4  morestuff       somenumbers     yetanotherfield living room; furniture; sofa
stuff5  morest.uff      alsonumbers     anotherfield    kitchen; furniture; table

在第一个块中，条件fr == fnr，存储$ 2将足够好，而不是$ 0。
在file2的下一个块中，只需通过用a索引$ 5索引的值来修改第五字段。
最终的1告诉Awk打印$ 0，其中第5个字段如上所述。

Your script is close, but the action { print a,$0 } is strange.
Would you please try the following:

awk '
    BEGIN {FS = OFS = "\t"}
    NR==FNR {a[$1] = $2; next}
    {$5 = a[$5]}
1' file1 file2 > out

Output:

stuff1  mo_restuff      somenumbers     anotherfield    living room; furniture; table
stuff2  morestuff       othernumbers    anotherfield    kitchen; furniture; chair
stuff3  more_stuff      somenumbers     anotherfield    living room; furniture; sofa
stuff4  morestuff       somenumbers     yetanotherfield living room; furniture; sofa
stuff5  morest.uff      alsonumbers     anotherfield    kitchen; furniture; table

In the first block for the condition FR==FNR, storing $2 will be good enough, not $0.
In the next block for file2, just modify the 5th field by replacing with the value of a indexed by $5.
The final 1 tells awk to print $0, of which the 5th field is modified as above.