希望自动化处理非常大的 .csv 文件的过程
这是我的困境。我需要处理大型 .csv 文件中的信息。文件的典型大小约为 200mb,大约有 3-4 百万行。 文件包含人类可读的信息。 5 个字段。 我的过程最初是获取文件并根据最后一个字段
“sort -k4 -n -t”对其进行排序,
然后拆分
“split -l 1000000”,
但是现在我需要从第二个字段的base64添加解码。
我用这个脚本来测试
awk 'BEGIN{FS=OFS=","} (cmd="echo "$2" | base64 --decode"; cmd | getline v;$2=v} 1' originalfilename.csv > newfilename .csv
- 在示例文件 a 上运行此命令时,我收到一条错误消息
“sh:fork:资源暂时不可用”
,并且该过程停止了
我用来导入 Excel 的几个 .csv 文件
。这是我的问题:
为什么我在运行 awk 时遇到错误?是否有一个不同的命令可以更快、更干净地完成我需要的操作
有人可以建议更好的方法来处理我的文件吗?我试图远离 java,但 python 脚本可能有效。
我不是专业程序员,因此我对工具的了解不是最新的。
任何建议/帮助将不胜感激。
谢谢
here is my dilemma. I need to process information from large .csv files. typical size of the file is about 200mb which results in approximately 3-4 million lines.
files contain human readable information. 5 fields.
my process originally was to take the file and sort it based on the last field
"sort -k4 -n -t, "
then split
"split -l 1000000"
however now i need to add a decode from base64 of the second field.
i used this script for testing
awk 'BEGIN{FS=OFS=","} (cmd="echo "$2" | base64 --decode"; cmd | getline v;$2=v} 1' originalfilename.csv > newfilename.csv
-- while running this command on a sample file a I am getting an error message.
"sh: fork: Resource temporarily unavailable"
and the process halts.
my final output are several .csv files I use to import into Excel tabs for processing.
here are my questions:
why am I getting the error while running awk? is there a different command that will do what i need faster and cleaner
can anyone suggest a better way to process my files? im trying to stay away from java but python script may work.
i am not a professional programmer, hence my knowledge ot tools is not up to date.
any advice/assistance would be greatly appreciated.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果你能安装一个程序,我推荐GoCSV;它是一个专门为 CSV 构建的多功能工具,具有许多用于操作数据的功能,包括解码 base64,并且它是 为许多平台预先构建。
我模拟了一个示例 CSV:
我们可以构建一个管道:
我将从base64解码开始,因为要做到这一点我们实际上需要添加一个新列:
然后我们可以使用
select
命令有效地用新列替换原始列:然后将新列重命名回旧名称:
从那里,添加基于第 4 列的排序,并拆分为 CSV,每个不超过 1_000_000 行:(
我的排序是相反的,因为第 4 列已经按升序排列。)
是整个管道:
这 10 行模型来自我制作的数据的大模型:
当我在大 CSV 上运行整个管道时:
If you can install a program, I recommend GoCSV; it's a purpose-built multi-tool for CSV, has many functions for manipulating data, including decoding base64, and it's pre-built for a number of platforms.
I mocked up a sample CSV:
We can build up to a pipeline for:
I'll start with the base64 decoding, because to do that we'll actually need to add a new column:
We can then use the
select
command to effecitvely replace the original column with the new column:and then rename the new column back to the old name:
From there, add in sorting, based on column 4, and split into CSVs, each no bigger than 1_000_000 rows:
(My sort is reversed because column 4 was already in ascending order.)
Here's the whole pipeline:
That small 10-row mock-up is from a big mock-up of your data I made:
When I run the whole pipeline on the big CSV:
DuckDB 是一个很棒的工具,可以使用以下命令一次性导入所有 CSV 文件
,然后使用“分区写入" 基于行号
因为您希望将文件分割为 100 万行,因为由于 17 年前的 Excel 限制(自 2007 年起),您可能需要检查 Visidata 以有效打开大文件 (CSV/Excel/...)。
DuckDB is a great tool to import all CSV files at once with the following
and then export them with "Partition writes" based on the row number
Since you want to split your files at 1 million lines because of a 17-years old Excel limitation (since 2007), you might want to check with Visidata to efficiently open big file (CSV/Excel/...).