我如何比较两个在unix中具有多个字段的文本文件
我有两个文本文件
文件 1
号码、名称、帐户 ID、vv、sfee、dac acc、TDID 7000,约翰,2,0,0,1,6 7001,埃伦,2,0,0,1,7 7002,萨米,2,0,0,1,6 第7003章,麦克,1,0,0,2,1 8001,耐克,1,2,4,1,8 第8002章 保罗,2,0,0,2,7
文件 2
号码,帐户 ID,dac acc,TDID 7000,2,1,6 7001,2,1,7 7002,2,1,6 7003,1,2,1
我想比较这两个文本文件。如果文件 2 的四列在文件 1 中并且相等意味着我想要像这样的输出
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2. txt file1.txt
.. 这对于比较两个文件中的两个单列很有用。我想比较多列。有人有建议吗?
编辑:来自OP的评论:
nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt
..这对于比较两个文件中的两个单列很有用。我想比较多列。你有什么建议吗?
i have two text files
file 1
number,name,account id,vv,sfee,dac acc,TDID 7000,john,2,0,0,1,6 7001,elen,2,0,0,1,7 7002,sami,2,0,0,1,6 7003,mike,1,0,0,2,1 8001,nike,1,2,4,1,8 8002,paul,2,0,0,2,7
file 2
number,account id,dac acc,TDID 7000,2,1,6 7001,2,1,7 7002,2,1,6 7003,1,2,1
i want to compare those two text files. if the four columns of file 2 is there in file 1 and equal means i want output like this
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt
.. this works good for comparing two single column in two files. i want to compare multiple column. any one have suggestion?
EDIT: From the OP's comments:
nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt
.. this works good for comparing two single column in two files. i want to compare multiple column. you have any suggestion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
此 awk 单行代码适用于未排序文件上的多列:
awk -F, 'NR==FNR{a[$1,$2,$3,$4 ]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt
为了使其工作,必须使用第一个文件作为输入 (file1.txt在我的示例中)是只有 4 个字段的文件,如下所示:
file1.txt
file2.txt
输出
或者,您也可以使用以下语法,该语法与您的问题中的语法更接近,但恕我直言,它的可读性不太好
This awk one-liner works for multi-column on unsorted files:
awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt
In order for this to work, it is imperative that the first file used for input (file1.txt in my example) be the file that only has 4 fields like so:
file1.txt
file2.txt
Output
Alternatively, you could also use the following syntax which more closely matches the one in your question but is not very readable IMHO
TxtSushi 看起来像你想要的。它允许使用 SQL 处理 CSV 文件。
TxtSushi looks like what you want. It allows to work with CSV files using SQL.
这不是一个优雅的单行代码,但你可以用 perl 来完成。
It's not an elegant one-liner, but you could do it with perl.
快速回答:使用
cut
分割出您需要的字段,并使用diff
比较结果。Quick answer: Use
cut
to split out the fields you need anddiff
to compare the results.没有经过很好的测试,但这可能有效:(
当然,这假设输入文件已排序)。
Not really well tested, but this might work:
(Of course, this assumes the input files are sorted).
这既不高效也不漂亮,但它可以完成工作。它不是最有效的实现,因为它多次解析 file1,但它也不将整个文件读入 RAM,因此比简单的脚本方法有一些好处。
其工作原理如下:
sed -n '2,$p' file1
将 file1 发送到 STDOUT,不带标题行为了使其正常工作,您必须确保在运行命令之前 file2 已排序。
针对您的示例数据运行此命令给出了以下结果
编辑
我从您的评论中注意到您遇到了排序错误。如果在运行管道命令之前对 file2 进行排序时发生此错误,那么您可以拆分文件,对每个部分进行排序,然后再次将它们组合在一起。
。
如果您的文件没有均匀分布在整个前导数字范围内,您可能需要修改传递给 for 的变量
This is neither efficient nor pretty it will however get the job done. It is not the most efficient implementation as it parses file1 multiple times however it does not read the entire file into RAM either so has some benefits over the simple scripting approaches.
This works as follows
sed -n '2,$p' file1
sends file1 to STDOUT without the header lineIn order for this to work you must ensure that file2 is sorted before running the command.
Running this against your example data gave the following result
EDIT
I note from your comments you are getting a sorting error. If this error is occuring when sorting file2 before running the pipeline command then you could split the file, sort each part and then cat them back together again.
Something like this would do that for you
You may need to modify the variables passed to for if your file is not distributed evenly across the full range of leading digits.
统计包 R 可以非常轻松地处理多个 csv 表。
请参阅简介。 R 或R 初学者。
The statistical package R handles processing multiple csv tables really easily.
See An Intro. to R or R for Beginners.