如何使用unix中的split命令动态地将大型CSV文件拆分为125MB-1000MB小型CSV文件
我试图将大型CSV文件拆分为小型CSV文件 具有125MB至1GB。如果我们给出数量的拆分命令 每个文件记录它将分开,但我想获得该行计数 动态基于文件大小。如果文件大小为20GB,则 使用复制命令将整个文件洗净到红移表中 但这需要大量时间,因此,如果我们将20GB文件切成 提到的大小文件,所以我会得到很好的结果。
示例20GB文件,我们可以将每个文件拆分6_000_000记录,以便以这种方式块文件大小 将大约为125MB,这样我想要600_000行计数 动态取决于大小
I am trying to split large csv files to small csv files which is
having 125MB to 1GB. split command will work if we give number of
records per file it will split but i want get that row count
dynamically on basis of file size. if the file size is 20GB then
while laoding this whole file into redshift table using copy command
but this is taking lot of time, so if we chunk the 20GB file into
mentioned size files so i will get good results.Example 20GB file we can split 6_000_000 records per file so in that way the chunk file size
will be around 125mb, in that way i want that 600_000 row count
dynamically depends on size
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在MB中获得文件大小,并除以一些预先确定的理想大小(在我的示例中,我选择了您的最低125MB),这将为您提供块的数量。
然后,您可以获得行计数(
wc -l </code>,假设您的CSV在单元格内没有线路断裂),然后除以该块数量,每块块。
每块行是您的“每块行”计数,您最终可以传递到
split
。因为我们正在做划分,这很可能会导致剩余的剩余时间,因此您可能会得到一个额外的文件,其中剩余的排相对较少(您可以在示例中看到)。
这是我对此进行编码的方式。我正在使用shellCheck,所以我认为这是非常合规的:
我创建了一个模拟CSV,使用60_000_000行(约5GB):
当我运行该脚本时,我得到了此输出:
You can get the file size in MB and divide by some ideal size that you need to predetermine (for my example I picked your minimum of 125MB), and that will give you the number of chunks.
You then get the row count (
wc -l
, assuming your CSV has no line breaks inside a cell) and divide that by the number of chunks to give your rows per chunk.Rows per chunk is your "lines per chunk" count that you can finally pass to
split
.Because we are doing division which will most likely result in a remainder, you'll probably get an extra file with a relatively few amount of these remainder rows (which you can see in the example).
Here's how I coded this up. I'm using shellcheck, so I think this is pretty POSIX compliant:
I created a mock CSV with 60_000_000 rows that is about 5GB:
When I ran that script I got this output: