使用方便的 I/O 将大列表划分为块
我有一个很大的列表,大小约为。 1.3GB。我正在寻找 R 中最快的解决方案来生成块并将它们保存为任何方便的格式,以便:
a)每个保存的块文件大小小于 100MB
b)原始文件可以方便快捷地将列表加载到新的 R 工作区中
编辑 II:这样做的原因是 R 解决方案可以绕过 GitHub 每个文件 100MB 的文件大小限制。 R 的限制是由于一些外部非技术限制造成的,我无法评论。
这个问题的最佳解决方案是什么?
编辑 I:因为评论中提到该问题的一些代码有助于创建一个更好的问题:
An R-example of a list with size of 1.3 GB:
li <- list(a = rnorm(10^8),
b = rnorm(10^7.8))
I have a large list with size of approx. 1.3GB. I'm looking for the fastest solution in R to generate chunks and save them in any convenient format so that :
a) every saved file of the chunk is less than 100MB large
b) the original list can be loaded conveniently and fast into a new R workspace
EDIT II : The reason to do so is a R-solution to bypass the GitHub file size restriction of 100MB per file. The limitation to R is due to some external non-technical restrictions which I can't comment.
What is the best solution for this problem?
EDIT I: Since it was mentioned in the comments that some code for the problem is helpful to create a better question:
An R-example of a list with size of 1.3 GB:
li <- list(a = rnorm(10^8),
b = rnorm(10^7.8))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
因此,您想要拆分文件并将其重新加载到单个数据帧中。
有一个问题:要减小文件大小,压缩是明智的做法,但文件大小并不完全确定。您可能需要调整一个参数。
以下是我用于类似任务的一段代码(尽管与 GitHub 无关)。
split.file 函数采用 3 个参数:数据帧、每个文件中要写入的行数以及基本文件名。例如,如果基本名称是“myfile”,则文件将为“myfile00001.rds”、“myfile00002.rds”等。
该函数返回写入的文件数。
join.files
函数采用基本名称。注意:
rows
参数找出适合 100 MB 的正确大小。这取决于您的数据,但对于类似的数据集,固定大小应该可以。但是,如果您正在处理非常不同的数据集,这种方法可能会失败。:示例
:
So, you want to split a file and to reload it in a single dataframe.
There is a twist: to reduce file size, it would be wise to compress, but then the file size is not entirely deterministic. You may have to tweak a parameter.
The following is a piece of code I have used for a similar task (unrelated to GitHub though).
The
split.file
function takes 3 arguments: a dataframe, the number of rows to write in each file, and the base filename. For instance, if basename is "myfile", the files will be "myfile00001.rds", "myfile00002.rds", etc.The function returns the number of files written.
The
join.files
function takes the base name.Note:
rows
parameter to find out the correct size to fit in 100 MB. It depends on your data, but for similar datasets a fixed size should do. However, if you are dealing with very different datasets, this approach will likely fail.Here are the functions:
Example: