如何使用unix中的split命令动态地将大型CSV文件拆分为125MB-1000MB小型CSV文件

发布于 2025-02-09 23:20:19 字数 282 浏览 2 评论 0原文

  • 我试图将大型CSV文件拆分为小型CSV文件 具有125MB至1GB。如果我们给出数量的拆分命令 每个文件记录它将分开,但我想获得该行计数 动态基于文件大小。如果文件大小为20GB,则 使用复制命令将整个文件洗净到红移表中 但这需要大量时间,因此,如果我们将20GB文件切成 提到的大小文件,所以我会得到很好的结果。

  • 示例20GB文件,我们可以将每个文件拆分6_000_000记录,以便以这种方式块文件大小 将大约为125MB,这样我想要600_000行计数 动态取决于大小

  • I am trying to split large csv files to small csv files which is
    having 125MB to 1GB. split command will work if we give number of
    records per file it will split but i want get that row count
    dynamically on basis of file size. if the file size is 20GB then
    while laoding this whole file into redshift table using copy command
    but this is taking lot of time, so if we chunk the 20GB file into
    mentioned size files so i will get good results.

  • Example 20GB file we can split 6_000_000 records per file so in that way the chunk file size
    will be around 125mb, in that way i want that 600_000 row count
    dynamically depends on size

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

尛丟丟 2025-02-16 23:20:19

您可以在MB中获得文件大小,并除以一些预先确定的理想大小(在我的示例中,我选择了您的最低125MB),这将为您提供块的数量。

然后,您可以获得行计数(wc -l <​​/code>,假设您的CSV在单元格内没有线路断裂),然后除以该块数量,每块块。

每块行是您的“每块行”计数,您最终可以传递到split

因为我们正在做划分,这很可能会导致剩余的剩余时间,因此您可能会得到一个额外的文件,其中剩余的排相对较少(您可以在示例中看到)。

这是我对此进行编码的方式。我正在使用shellCheck,所以我认为这是非常合规的:

csvFile=$1

maxSizeMB=125

rm -f chunked_*

fSizeMB=$(du -ms "$csvFile" | cut -f1)
echo "File size is $fSizeMB, max size per new file is $maxSizeMB"

nChunks=$(( fSizeMB / maxSizeMB ))
echo "Want $nChunks chunks"

nRows=$(wc -l "$csvFile" | cut -d' ' -f2)
echo "File row count is $nRows"

nRowsPerChunk=$(( nRows / nChunks ))
echo "Need $nChunks files at around $nRowsPerChunk rows per file (plus one more file, maybe, for remainder)"


split -d -a 4 -l $nRowsPerChunk "$csvFile" "chunked_"


echo "Row (line) counts per file:"
wc -l chunked_00*

echo
echo "Size (MB) per file:"
du -ms chunked_00*

我创建了一个模拟CSV,使用60_000_000行(约5GB):

ll -h gen_60000000x11.csv
-rw-r--r--  1 zyoung  staff   4.7G Jun 24 15:21 gen_60000000x11.csv

当我运行该脚本时,我得到了此输出:

./main.sh gen_60000000x11.csv
File size is 4801MB, max size per new file is 125MB
Want 38 chunks
File row count is 60000000
Need 38 files at around 1578947 rows per file (plus one more file, maybe, for remainder)
Row (line) counts per file:
 1578947 chunked_0000
 1578947 chunked_0001
 1578947 chunked_0002
 ...
 1578947 chunked_0036
 1578947 chunked_0037
      14 chunked_0038
 60000000 total

Size (MB) per file:
129     chunked_0000
129     chunked_0001
129     chunked_0002
...
129     chunked_0036
129     chunked_0037
1       chunked_0038

You can get the file size in MB and divide by some ideal size that you need to predetermine (for my example I picked your minimum of 125MB), and that will give you the number of chunks.

You then get the row count (wc -l, assuming your CSV has no line breaks inside a cell) and divide that by the number of chunks to give your rows per chunk.

Rows per chunk is your "lines per chunk" count that you can finally pass to split.

Because we are doing division which will most likely result in a remainder, you'll probably get an extra file with a relatively few amount of these remainder rows (which you can see in the example).

Here's how I coded this up. I'm using shellcheck, so I think this is pretty POSIX compliant:

csvFile=$1

maxSizeMB=125

rm -f chunked_*

fSizeMB=$(du -ms "$csvFile" | cut -f1)
echo "File size is $fSizeMB, max size per new file is $maxSizeMB"

nChunks=$(( fSizeMB / maxSizeMB ))
echo "Want $nChunks chunks"

nRows=$(wc -l "$csvFile" | cut -d' ' -f2)
echo "File row count is $nRows"

nRowsPerChunk=$(( nRows / nChunks ))
echo "Need $nChunks files at around $nRowsPerChunk rows per file (plus one more file, maybe, for remainder)"


split -d -a 4 -l $nRowsPerChunk "$csvFile" "chunked_"


echo "Row (line) counts per file:"
wc -l chunked_00*

echo
echo "Size (MB) per file:"
du -ms chunked_00*

I created a mock CSV with 60_000_000 rows that is about 5GB:

ll -h gen_60000000x11.csv
-rw-r--r--  1 zyoung  staff   4.7G Jun 24 15:21 gen_60000000x11.csv

When I ran that script I got this output:

./main.sh gen_60000000x11.csv
File size is 4801MB, max size per new file is 125MB
Want 38 chunks
File row count is 60000000
Need 38 files at around 1578947 rows per file (plus one more file, maybe, for remainder)
Row (line) counts per file:
 1578947 chunked_0000
 1578947 chunked_0001
 1578947 chunked_0002
 ...
 1578947 chunked_0036
 1578947 chunked_0037
      14 chunked_0038
 60000000 total

Size (MB) per file:
129     chunked_0000
129     chunked_0001
129     chunked_0002
...
129     chunked_0036
129     chunked_0037
1       chunked_0038
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文