希望自动化处理非常大的 .csv 文件的过程

发布于 2025-01-16 10:49:19 字数 695 浏览 6 评论 0原文

这是我的困境。我需要处理大型 .csv 文件中的信息。文件的典型大小约为 200mb,大约有 3-4 百万行。 文件包含人类可读的信息。 5 个字段。 我的过程最初是获取文件并根据最后一个字段

“sort -k4 -n -t”对其进行排序,

然后拆分

“split -l 1000000”,

但是现在我需要从第二个字段的base64添加解码。

我用这个脚本来测试

awk 'BEGIN{FS=OFS=","} (cmd="echo "$2" | base64 --decode"; cmd | getline v;$2=v} 1' originalfilename.csv > newfilename .csv

- 在示例文件 a 上运行此命令时,我收到一条错误消息

“sh:fork:资源暂时不可用”

,并且该过程停止了

我用来导入 Excel 的几个 .csv 文件

。这是我的问题:

  1. 为什么我在运行 awk 时遇到错误?是否有一个不同的命令可以更快、更干净地完成我需要的操作

  2. 有人可以建议更好的方法来处理我的文件吗?我试图远离 java,但 python 脚本可能有效。

我不是专业程序员,因此我对工具的了解不是最新的。

任何建议/帮助将不胜感激。

谢谢

here is my dilemma. I need to process information from large .csv files. typical size of the file is about 200mb which results in approximately 3-4 million lines.
files contain human readable information. 5 fields.
my process originally was to take the file and sort it based on the last field

"sort -k4 -n -t, "

then split

"split -l 1000000"

however now i need to add a decode from base64 of the second field.

i used this script for testing

awk 'BEGIN{FS=OFS=","} (cmd="echo "$2" | base64 --decode"; cmd | getline v;$2=v} 1' originalfilename.csv > newfilename.csv

-- while running this command on a sample file a I am getting an error message.

"sh: fork: Resource temporarily unavailable"

and the process halts.

my final output are several .csv files I use to import into Excel tabs for processing.

here are my questions:

  1. why am I getting the error while running awk? is there a different command that will do what i need faster and cleaner

  2. can anyone suggest a better way to process my files? im trying to stay away from java but python script may work.

i am not a professional programmer, hence my knowledge ot tools is not up to date.

any advice/assistance would be greatly appreciated.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

浅浅淡淡 2025-01-23 10:49:19

如果你能安装一个程序,我推荐GoCSV;它是一个专门为 CSV 构建的多功能工具,具有许多用于操作数据的功能,包括解码 base64,并且它是 为许多平台预先构建

我模拟了一个示例 CSV:

Col1,Col2,Col3,Col4,Col5
banana,U3RhY2tPdmVyZmxvdw==,apple,0,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,1,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,2,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,3,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,4,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,5,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,6,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,7,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,8,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,9,turnip

我们可以构建一个管道:

  • base64 解码第二列(实际上必须添加一个全新的列),
  • 选择该新列并将其放回原来的位置(实质上,覆盖原始的base64编码列)
  • 将新列重命名为旧列
  • 在第四列上
  • 排序每1_000_000行

我将从base64解码开始,因为要做到这一点我们实际上需要添加一个新列:

gocsv add -name New_col -t '{{ b64dec .Col2 }}' small.csv
Col1,Col2,Col3,Col4,Col5,New_col
banana,U3RhY2tPdmVyZmxvdw==,apple,0,turnip,StackOverflow
banana,U3RhY2tPdmVyZmxvdw==,apple,1,turnip,StackOverflow
...

然后我们可以使用 select 命令有效地用新列替换原始列:

...
| gocsv select -c 1,New_col,3-5
Col1,New_col,Col3,Col4,Col5
banana,StackOverflow,apple,0,turnip
banana,StackOverflow,apple,1,turnip

然后将新列重命名回旧名称:

...
| gocsv rename -c New_col -names Col2
Col1,Col2,Col3,Col4,Col5
banana,StackOverflow,apple,0,turnip
banana,StackOverflow,apple,1,turnip

从那里,添加基于第 4 列的排序,并拆分为 CSV,每个不超过 1_000_000 行:(

...
| gocsv sort -c 4 -reverse \
| gocsv split -max-rows 1_000_000

我的排序是相反的,因为第 4 列已经按升序排列。)

是整个管道:

gocsv add -name New_col -t '{{ b64dec .Col2 }}' big.csv \
| gocsv select -c 1,New_col,3-5 \
| gocsv rename -c New_col -names Col2 \
| gocsv sort -c 4 -reverse \
| gocsv split -max-rows 1_000_000

这 10 行模型来自我制作的数据的大模型:

% gocsv dims big.csv 
Dimensions:
  Rows: 4000000
  Columns: 5

当我在大 CSV 上运行整个管道时:

% time ./main.sh
./main.sh  21.10s user 13.55s system 165% cpu 20.951 total

% ls out*.csv
out-1.csv       out-2.csv       out-3.csv       out-4.csv

% gocsv dims out-1.csv 
Dimensions:
  Rows: 1000000
  Columns: 5

% gocsv head -n 2 out-1.csv
Col1,Col2,Col3,Col4,Col5
banana,StackOverflow,apple,3999999,turnip
banana,StackOverflow,apple,3999998,turnip

% gocsv tail -n 2 out-4.csv
Col1,Col2,Col3,Col4,Col5
banana,StackOverflow,apple,1,turnip
banana,StackOverflow,apple,0,turnip

If you can install a program, I recommend GoCSV; it's a purpose-built multi-tool for CSV, has many functions for manipulating data, including decoding base64, and it's pre-built for a number of platforms.

I mocked up a sample CSV:

Col1,Col2,Col3,Col4,Col5
banana,U3RhY2tPdmVyZmxvdw==,apple,0,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,1,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,2,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,3,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,4,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,5,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,6,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,7,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,8,turnip
banana,U3RhY2tPdmVyZmxvdw==,apple,9,turnip

We can build up to a pipeline for:

  • base64-decoding the 2nd column (which actually has to add a whole new column)
  • selecting that new column and putting it back in its place (in essence, overriding the original base64-encoded column)
  • renaming the new column to the old column
  • sorting on the 4th column
  • splitting every 1_000_000 rows

I'll start with the base64 decoding, because to do that we'll actually need to add a new column:

gocsv add -name New_col -t '{{ b64dec .Col2 }}' small.csv
Col1,Col2,Col3,Col4,Col5,New_col
banana,U3RhY2tPdmVyZmxvdw==,apple,0,turnip,StackOverflow
banana,U3RhY2tPdmVyZmxvdw==,apple,1,turnip,StackOverflow
...

We can then use the select command to effecitvely replace the original column with the new column:

...
| gocsv select -c 1,New_col,3-5
Col1,New_col,Col3,Col4,Col5
banana,StackOverflow,apple,0,turnip
banana,StackOverflow,apple,1,turnip

and then rename the new column back to the old name:

...
| gocsv rename -c New_col -names Col2
Col1,Col2,Col3,Col4,Col5
banana,StackOverflow,apple,0,turnip
banana,StackOverflow,apple,1,turnip

From there, add in sorting, based on column 4, and split into CSVs, each no bigger than 1_000_000 rows:

...
| gocsv sort -c 4 -reverse \
| gocsv split -max-rows 1_000_000

(My sort is reversed because column 4 was already in ascending order.)

Here's the whole pipeline:

gocsv add -name New_col -t '{{ b64dec .Col2 }}' big.csv \
| gocsv select -c 1,New_col,3-5 \
| gocsv rename -c New_col -names Col2 \
| gocsv sort -c 4 -reverse \
| gocsv split -max-rows 1_000_000

That small 10-row mock-up is from a big mock-up of your data I made:

% gocsv dims big.csv 
Dimensions:
  Rows: 4000000
  Columns: 5

When I run the whole pipeline on the big CSV:

% time ./main.sh
./main.sh  21.10s user 13.55s system 165% cpu 20.951 total

% ls out*.csv
out-1.csv       out-2.csv       out-3.csv       out-4.csv

% gocsv dims out-1.csv 
Dimensions:
  Rows: 1000000
  Columns: 5

% gocsv head -n 2 out-1.csv
Col1,Col2,Col3,Col4,Col5
banana,StackOverflow,apple,3999999,turnip
banana,StackOverflow,apple,3999998,turnip

% gocsv tail -n 2 out-4.csv
Col1,Col2,Col3,Col4,Col5
banana,StackOverflow,apple,1,turnip
banana,StackOverflow,apple,0,turnip
心病无药医 2025-01-23 10:49:19

DuckDB 是一个很棒的工具,可以使用以下命令一次性导入所有 CSV 文件

CREATE TABLE newtable AS SELECT Col1,from_base64(Col2)::VARCHAR AS Col2,Col3,Col4,Col5  from read_csv_auto('file*.csv');

,然后使用“分区写入" 基于行号

因为您希望将文件分割为 100 万行,因为由于 17 年前的 Excel 限制(自 2007 年起),您可能需要检查 Visidata 以有效打开大文件 (CSV/Excel/...)。

DuckDB is a great tool to import all CSV files at once with the following

CREATE TABLE newtable AS SELECT Col1,from_base64(Col2)::VARCHAR AS Col2,Col3,Col4,Col5  from read_csv_auto('file*.csv');

and then export them with "Partition writes" based on the row number

Since you want to split your files at 1 million lines because of a 17-years old Excel limitation (since 2007), you might want to check with Visidata to efficiently open big file (CSV/Excel/...).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文