如何使用BASH从另一个文件中的ID列表中快速过滤巨大的CSV文件？

发布于 2025-02-13 03:14:25 字数 530 浏览 2 评论 0原文

我有2个CSV文件。第一个只是ID列表（约300k行）。避免将其称为ids.csv

id
1
3
7
...

第二个是具有ID的对象列表（约为2000万行）。让我们称其为data.csv，

id,date,state
1,2022-01-01,true
4,2022-01-03,false
...

我想构建带有来自IDS.CSV ID的data.csv行的第三CSV文件。 data.csv中没有必要所有IDS.CSV的ID。

我尝试了这样的事情：

while IFS=, read -r line
do
    awk -F',' -v id='$line' '$1==id' data.csv >> out.csv
done < ids.csv

它正在工作，但是执行需要永远。我试图将其分为几个流，但是该脚本本身的工作缓慢，因此它没有帮助Mutch。

您能建议我更好或最佳的方法如何更快地过滤该数据。

原文

I have 2 csv files.
The first one is just a list of IDs (around 300k rows). Lest call it ids.csv

id
1
3
7
...

The second one is list of objects with id (around 2 milions rows). Lets call it data.csv

id,date,state
1,2022-01-01,true
4,2022-01-03,false
...

I would like to build the 3rd csv file with rows from data.csv which having IDs from ids.csv. It is not necessary that all of IDs from ids.csv will be present in data.csv.

I tried something like this:

while IFS=, read -r line
do
    awk -F',' -v id='$line' '$1==id' data.csv >> out.csv
done < ids.csv

It is working, but executions takes something like forever. I tried to split it for several streams, but the script is working slow by itself so it was not helping mutch.

Can you suggest me better or more optimal way how to filter that data.csv faster?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

亽野灬性zι浪 2025-02-20 03:14:25

可能以一种或另一种方式上最常见的答案是：

awk -F',' 'NR==FNR{ids[$1]; next} $1 in ids' ids.csv data.csv > out.csv

这将比您现有的脚本快。

关于为什么您的原始脚本很慢，请参见 Why-is-Is-us-a-shell-shell-loop-to-to-tox-text-text-text-the-tossed-bad-practice 。

Probably the most common answer on this forum in one way or another:

awk -F',' 'NR==FNR{ids[$1]; next} $1 in ids' ids.csv data.csv > out.csv

That will be an order of magnitude faster than your existing script.

Regarding why your original script is slow, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

回复收藏 0 原文

明媚如初 2025-02-20 03:14:25

使用CSVGREP来自Handy csvkit cosvkit bun bundle code：

$ csvgrep -c id -f ids.csv data.csv
id,date,state
1,2022-01-01,true

code > -f文件名就像普通的grep's -f参数;如果其中的任何行是给定列名称的匹配，则该记录是打印的。

Using csvgrep from the handy csvkit bundle of utilities:

$ csvgrep -c id -f ids.csv data.csv
id,date,state
1,2022-01-01,true

-f filename is like plain grep's -f argument; if any of the lines in it are a match for the given column name, that record is printed.

回复收藏 0 原文

嘦怹 2025-02-20 03:14:25

您可以为此重新设计标准删除代码：

 {m,g}awk 'NR==FNR ?!--__[$_]:__[$!_]' FS=',' ids.csv data.csv
 or
 {m,g}awk -F',' '(_=FNR==NR)<(__[$!_]+=_)' ids.csv data.csv

 1  id,date,state
 2  1,2022-01-01,true

you can rework the standard de-duping code for this :

 {m,g}awk 'NR==FNR ?!--__[$_]:__[$!_]' FS=',' ids.csv data.csv
 or
 {m,g}awk -F',' '(_=FNR==NR)<(__[$!_]+=_)' ids.csv data.csv

 1  id,date,state
 2  1,2022-01-01,true

回复收藏 0 原文

~没有更多了~