当前位置：文江博客话题详情

如何使用unix中的split命令动态地将大型CSV文件拆分为125MB-1000MB小型CSV文件

发布于 2025-02-09 23:20:19 字数 282 浏览 2 评论 0原文

我试图将大型CSV文件拆分为小型CSV文件具有125MB至1GB。如果我们给出数量的拆分命令每个文件记录它将分开，但我想获得该行计数动态基于文件大小。如果文件大小为20GB，则使用复制命令将整个文件洗净到红移表中但这需要大量时间，因此，如果我们将20GB文件切成提到的大小文件，所以我会得到很好的结果。
示例20GB文件，我们可以将每个文件拆分6_000_000记录，以便以这种方式块文件大小将大约为125MB，这样我想要600_000行计数动态取决于大小

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尛丟丟 2025-02-16 23:20:19

您可以在MB中获得文件大小，并除以一些预先确定的理想大小（在我的示例中，我选择了您的最低125MB），这将为您提供块的数量。

然后，您可以获得行计数（wc -l </code>，假设您的CSV在单元格内没有线路断裂），然后除以该块数量，每块块。

每块行是您的“每块行”计数，您最终可以传递到split。

因为我们正在做划分，这很可能会导致剩余的剩余时间，因此您可能会得到一个额外的文件，其中剩余的排相对较少（您可以在示例中看到）。

这是我对此进行编码的方式。我正在使用shellCheck，所以我认为这是非常合规的：

csvFile=$1

maxSizeMB=125

rm -f chunked_*

fSizeMB=$(du -ms "$csvFile" | cut -f1)
echo "File size is $fSizeMB, max size per new file is $maxSizeMB"

nChunks=$(( fSizeMB / maxSizeMB ))
echo "Want $nChunks chunks"

nRows=$(wc -l "$csvFile" | cut -d' ' -f2)
echo "File row count is $nRows"

nRowsPerChunk=$(( nRows / nChunks ))
echo "Need $nChunks files at around $nRowsPerChunk rows per file (plus one more file, maybe, for remainder)"


split -d -a 4 -l $nRowsPerChunk "$csvFile" "chunked_"


echo "Row (line) counts per file:"
wc -l chunked_00*

echo
echo "Size (MB) per file:"
du -ms chunked_00*

我创建了一个模拟CSV，使用60_000_000行（约5GB）：

ll -h gen_60000000x11.csv
-rw-r--r--  1 zyoung  staff   4.7G Jun 24 15:21 gen_60000000x11.csv

当我运行该脚本时，我得到了此输出：

./main.sh gen_60000000x11.csv
File size is 4801MB, max size per new file is 125MB
Want 38 chunks
File row count is 60000000
Need 38 files at around 1578947 rows per file (plus one more file, maybe, for remainder)
Row (line) counts per file:
 1578947 chunked_0000
 1578947 chunked_0001
 1578947 chunked_0002
 ...
 1578947 chunked_0036
 1578947 chunked_0037
      14 chunked_0038
 60000000 total

Size (MB) per file:
129     chunked_0000
129     chunked_0001
129     chunked_0002
...
129     chunked_0036
129     chunked_0037
1       chunked_0038

You can get the file size in MB and divide by some ideal size that you need to predetermine (for my example I picked your minimum of 125MB), and that will give you the number of chunks.

You then get the row count (wc -l, assuming your CSV has no line breaks inside a cell) and divide that by the number of chunks to give your rows per chunk.

Rows per chunk is your "lines per chunk" count that you can finally pass to split.

Because we are doing division which will most likely result in a remainder, you'll probably get an extra file with a relatively few amount of these remainder rows (which you can see in the example).

Here's how I coded this up. I'm using shellcheck, so I think this is pretty POSIX compliant:

csvFile=$1

maxSizeMB=125

rm -f chunked_*

fSizeMB=$(du -ms "$csvFile" | cut -f1)
echo "File size is $fSizeMB, max size per new file is $maxSizeMB"

nChunks=$(( fSizeMB / maxSizeMB ))
echo "Want $nChunks chunks"

nRows=$(wc -l "$csvFile" | cut -d' ' -f2)
echo "File row count is $nRows"

nRowsPerChunk=$(( nRows / nChunks ))
echo "Need $nChunks files at around $nRowsPerChunk rows per file (plus one more file, maybe, for remainder)"


split -d -a 4 -l $nRowsPerChunk "$csvFile" "chunked_"


echo "Row (line) counts per file:"
wc -l chunked_00*

echo
echo "Size (MB) per file:"
du -ms chunked_00*

I created a mock CSV with 60_000_000 rows that is about 5GB:

ll -h gen_60000000x11.csv
-rw-r--r--  1 zyoung  staff   4.7G Jun 24 15:21 gen_60000000x11.csv

When I ran that script I got this output:

./main.sh gen_60000000x11.csv
File size is 4801MB, max size per new file is 125MB
Want 38 chunks
File row count is 60000000
Need 38 files at around 1578947 rows per file (plus one more file, maybe, for remainder)
Row (line) counts per file:
 1578947 chunked_0000
 1578947 chunked_0001
 1578947 chunked_0002
 ...
 1578947 chunked_0036
 1578947 chunked_0037
      14 chunked_0038
 60000000 total

Size (MB) per file:
129     chunked_0000
129     chunked_0001
129     chunked_0002
...
129     chunked_0036
129     chunked_0037
1       chunked_0038

回复收藏 0 原文

~没有更多了~