我如何比较两个在unix中具有多个字段的文本文件

发布于 2024-09-08 11:54:08 字数 778 浏览 2 评论 0原文

我有两个文本文件

  • 文件 1

    号码、名称、帐户 ID、vv、sfee、dac acc、TDID
    7000,约翰,2,0,0,1,6
    7001,埃伦,2,0,0,1,7
    7002,萨米,2,0,0,1,6
    第7003章,麦克,1,0,0,2,1
    8001,耐克,1,2,4,1,8
    第8002章 保罗,2,0,0,2,7 
    
  • 文件 2

    号码,帐户 ID,dac acc,TDID
    7000,2,1,6
    7001,2,1,7
    7002,2,1,6
    7003,1,2,1
    

我想比较这两个文本文件。如果文件 2 的四列在文件 1 中并且相等意味着我想要像这样的输出

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2. txt file1.txt.. 这对于比较两个文件中的两个单列很有用。我想比较多列。有人有建议吗?


编辑:来自OP的评论:

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt

..这对于比较两个文件中的两个单列很有用。我想比较多列。你有什么建议吗?

i have two text files

  • file 1

    number,name,account id,vv,sfee,dac acc,TDID
    7000,john,2,0,0,1,6
    7001,elen,2,0,0,1,7
    7002,sami,2,0,0,1,6
    7003,mike,1,0,0,2,1
    8001,nike,1,2,4,1,8
    8002,paul,2,0,0,2,7 
    
  • file 2

    number,account id,dac acc,TDID
    7000,2,1,6
    7001,2,1,7
    7002,2,1,6
    7003,1,2,1
    

i want to compare those two text files. if the four columns of file 2 is there in file 1 and equal means i want output like this

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt.. this works good for comparing two single column in two files. i want to compare multiple column. any one have suggestion?


EDIT: From the OP's comments:

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt

.. this works good for comparing two single column in two files. i want to compare multiple column. you have any suggestion?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

酒废 2024-09-15 11:54:08

此 awk 单行代码适用于未排序文件上的多列:

awk -F, 'NR==FNR{a[$1,$2,$3,$4 ]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt

为了使其工作,必须使用第一个文件作为输入 (file1.txt在我的示例中)是只有 4 个字段的文件,如下所示:

file1.txt

7000,2,1,6
7001,2,1,7
7002,2,1,6
7003,1,2,1

file2.txt

7000,john,2,0,0,1,6
7000,john,2,0,0,1,7
7000,john,2,0,0,1,8
7000,john,2,0,0,1,9
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
7003,mike,1,0,0,2,2
7003,mike,1,0,0,2,3
7003,mike,1,0,0,2,4
8001,nike,1,2,4,1,8
8002,paul,2,0,0,2,7

输出

$ awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

或者,您也可以使用以下语法,该语法与您的问题中的语法更接近,但恕我直言,它的可读性不太好

awk -F, 'NR==FNR{a[$1,$2,$3,$4];next} ($1SUBSEP$3SUBSEP$6SUBSEP$7 in a)' file1.txt file2.txt

This awk one-liner works for multi-column on unsorted files:

awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt

In order for this to work, it is imperative that the first file used for input (file1.txt in my example) be the file that only has 4 fields like so:

file1.txt

7000,2,1,6
7001,2,1,7
7002,2,1,6
7003,1,2,1

file2.txt

7000,john,2,0,0,1,6
7000,john,2,0,0,1,7
7000,john,2,0,0,1,8
7000,john,2,0,0,1,9
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
7003,mike,1,0,0,2,2
7003,mike,1,0,0,2,3
7003,mike,1,0,0,2,4
8001,nike,1,2,4,1,8
8002,paul,2,0,0,2,7

Output

$ awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

Alternatively, you could also use the following syntax which more closely matches the one in your question but is not very readable IMHO

awk -F, 'NR==FNR{a[$1,$2,$3,$4];next} ($1SUBSEP$3SUBSEP$6SUBSEP$7 in a)' file1.txt file2.txt
窝囊感情。 2024-09-15 11:54:08

TxtSushi 看起来像你想要的。它允许使用 SQL 处理 CSV 文件。

TxtSushi looks like what you want. It allows to work with CSV files using SQL.

爱*していゐ 2024-09-15 11:54:08

这不是一个优雅的单行代码,但你可以用 perl 来完成。

#!/usr/bin/perl
open A, $ARGV[0];
while(split/,/,<A>) {
    $k{$_[0]} = [@_];
}
close A;

open B, $ARGV[1];
while(split/,/,<B>) {
    print join(',',@{$k{$_[0]}}) if
        defined($k{$_[0]}) &&
        $k{$_[0]}->[2] == $_[1] &&
        $k{$_[0]}->[5] == $_[2] &&
        $k{$_[0]}->[6] == $_[3];
}
close B;

It's not an elegant one-liner, but you could do it with perl.

#!/usr/bin/perl
open A, $ARGV[0];
while(split/,/,<A>) {
    $k{$_[0]} = [@_];
}
close A;

open B, $ARGV[1];
while(split/,/,<B>) {
    print join(',',@{$k{$_[0]}}) if
        defined($k{$_[0]}) &&
        $k{$_[0]}->[2] == $_[1] &&
        $k{$_[0]}->[5] == $_[2] &&
        $k{$_[0]}->[6] == $_[3];
}
close B;
拔了角的鹿 2024-09-15 11:54:08

快速回答:使用 cut 分割出您需要的字段,并使用 diff 比较结果。

Quick answer: Use cut to split out the fields you need and diff to compare the results.

不弃不离 2024-09-15 11:54:08

没有经过很好的测试,但这可能有效:(

join -t, file1 file2 | awk -F, 'BEGIN{OFS=","} {if ($3==$8 && $6==$9 && $7==$10) print $1,$2,$3,$4,$6,$7}'

当然,这假设输入文件已排序)。

Not really well tested, but this might work:

join -t, file1 file2 | awk -F, 'BEGIN{OFS=","} {if ($3==$8 && $6==$9 && $7==$10) print $1,$2,$3,$4,$6,$7}'

(Of course, this assumes the input files are sorted).

昨迟人 2024-09-15 11:54:08

这既不高效也不漂亮,但它可以完成工作。它不是最有效的实现,因为它多次解析 file1,但它也不将整个文件读入 RAM,因此比简单的脚本方法有一些好处。

sed -n '2,$p' file1 | awk -F, '{print $1 "," $3 "," $6 "," $7 " " $0 }' | \
sort | join file2 - |awk '{print $2}'

其工作原理如下:

  1. sed -n '2,$p' file1 将 file1 发送到 STDOUT,不带标题行
  2. 第一个 awk 命令以与它们中的格式相同的格式打印 file1 中的 4 个“关键字段” file2 后跟一个空格,后跟 file1 的内容
  3. sort 命令确保 file1 与 file2 的顺序相同
  4. join 命令连接 file2 和 STDOUT,仅写入在 file2 中具有匹配记录的记录
  5. 最后的 awk 命令仅打印原始部分file1 的文件

为了使其正常工作,您必须确保在运行命令之前 file2 已排序。

针对您的示例数据运行此命令给出了以下结果

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

编辑

我从您的评论中注意到您遇到了排序错误。如果在运行管道命令之前对 file2 进行排序时发生此错误,那么您可以拆分文件,对每个部分进行排序,然后再次将它们组合在一起。

mv file2 file2.orig
for i in 0 1 2 3 4 5 6 7 8 9
do
  grep "^${i}" file2.orig |sort > file2.$i
done
cat file2.[0-9] >file2
rm file2.[0-9] file2.orig

如果您的文件没有均匀分布在整个前导数字范围内,您可能需要修改传递给 for 的变量

This is neither efficient nor pretty it will however get the job done. It is not the most efficient implementation as it parses file1 multiple times however it does not read the entire file into RAM either so has some benefits over the simple scripting approaches.

sed -n '2,$p' file1 | awk -F, '{print $1 "," $3 "," $6 "," $7 " " $0 }' | \
sort | join file2 - |awk '{print $2}'

This works as follows

  1. sed -n '2,$p' file1 sends file1 to STDOUT without the header line
  2. The first awk command prints the 4 "key fields" from file1 in the same format as they are in file2 followed by a space followed by the contents of file1
  3. The sort command ensures that file1 is in the same order as file2
  4. The join command joins file2 and STDOUT only writing records that have a matching record in file2
  5. The final awk command prints just the original part of file1

In order for this to work you must ensure that file2 is sorted before running the command.

Running this against your example data gave the following result

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

EDIT

I note from your comments you are getting a sorting error. If this error is occuring when sorting file2 before running the pipeline command then you could split the file, sort each part and then cat them back together again.

Something like this would do that for you

mv file2 file2.orig
for i in 0 1 2 3 4 5 6 7 8 9
do
  grep "^${i}" file2.orig |sort > file2.$i
done
cat file2.[0-9] >file2
rm file2.[0-9] file2.orig

You may need to modify the variables passed to for if your file is not distributed evenly across the full range of leading digits.

说不完的你爱 2024-09-15 11:54:08

统计包 R 可以非常轻松地处理多个 csv 表。
请参阅简介。 RR 初学者

The statistical package R handles processing multiple csv tables really easily.
See An Intro. to R or R for Beginners.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文