如何使用BASH从另一个文件中的ID列表中快速过滤巨大的CSV文件?
我有2个CSV文件。 第一个只是ID列表(约300k行)。避免将其称为ids.csv
id
1
3
7
...
第二个是具有ID的对象列表(约为2000万行)。让我们称其为data.csv,
id,date,state
1,2022-01-01,true
4,2022-01-03,false
...
我想构建带有来自IDS.CSV ID的data.csv行的第三CSV文件。 data.csv中没有必要所有IDS.CSV的ID。
我尝试了这样的事情:
while IFS=, read -r line
do
awk -F',' -v id='$line' '$1==id' data.csv >> out.csv
done < ids.csv
它正在工作,但是执行需要永远。我试图将其分为几个流,但是该脚本本身的工作缓慢,因此它没有帮助Mutch。
您能建议我更好或最佳的方法如何更快地过滤该数据。
I have 2 csv files.
The first one is just a list of IDs (around 300k rows). Lest call it ids.csv
id
1
3
7
...
The second one is list of objects with id (around 2 milions rows). Lets call it data.csv
id,date,state
1,2022-01-01,true
4,2022-01-03,false
...
I would like to build the 3rd csv file with rows from data.csv which having IDs from ids.csv. It is not necessary that all of IDs from ids.csv will be present in data.csv.
I tried something like this:
while IFS=, read -r line
do
awk -F',' -v id='$line' '$1==id' data.csv >> out.csv
done < ids.csv
It is working, but executions takes something like forever. I tried to split it for several streams, but the script is working slow by itself so it was not helping mutch.
Can you suggest me better or more optimal way how to filter that data.csv faster?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
可能以一种或另一种方式上最常见的答案是:
这将比您现有的脚本快。
关于为什么您的原始脚本很慢,请参见 Why-is-Is-us-a-shell-shell-loop-to-to-tox-text-text-text-the-tossed-bad-practice 。
Probably the most common answer on this forum in one way or another:
That will be an order of magnitude faster than your existing script.
Regarding why your original script is slow, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
使用
CSVGREP
来自Handy csvkit cosvkit bun bundle code:code > -f文件名就像普通的
grep
's-f
参数;如果其中的任何行是给定列名称的匹配,则该记录是打印的。Using
csvgrep
from the handy csvkit bundle of utilities:-f filename
is like plaingrep
's-f
argument; if any of the lines in it are a match for the given column name, that record is printed.您可以为此重新设计标准删除代码:
you can rework the standard de-duping code for this :