如何使用 bigmemory 包加载具有混合类型列的大 csv 文件
有没有办法结合使用 bigmemory 包中的 scan() 和 read.big.matrix() 来读取具有混合类型列的 200 MB .csv 文件,以便结果是具有整数、字符、和数字列?
Is there a way to combine the use of scan() and read.big.matrix() from the bigmemory package to read in a 200 MB .csv file with mixed-type columns so that the result is a dataframe with integer, character, and numeric columns?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
为此尝试 ff 包。
对于 200Mb 来说,任务应该像这样简单。
(对于更大的文件,可能需要您研究一些配置选项,具体取决于您的计算机和操作系统)。
Try the ff package for this.
For 200Mb it should be as simple a task as this.
(For much bigger files it will likely require that you investigate some of the configuration options, depending on your machine and OS).
啊,这辈子有些事是不可能的,有些事是被人误解而酿成不愉快的事情。 @Roman是对的:矩阵必须是一种原子类型。它不是一个数据框。
由于矩阵必须是一种类型,因此尝试使用大内存来处理多种类型本身就是一件坏事。能做到吗?我不会去那里。为什么?因为其他一切都会假设它正在获取一个矩阵,而不是一个数据帧。这将导致更多的问题和更多的悲伤。
现在,您可以做的是识别每个列的类型,并生成一组不同的大内存文件,每个文件都包含特定类型的项目。例如 charBM = 字符大矩阵,intBM = 整数大矩阵,等等。然后,您也许可以开发一个包装器,从所有这些中生成一个数据帧。但我仍然不建议:将不同的项目按其本来的样子对待,或者如果可以的话强制同质化,而不是尝试生成一个大数据帧格里芬。
@mdsumner 的建议是正确的
ff
。另一个存储选项是 HDF5,您可以通过 R 中的 ncdf4 访问它。不幸的是,这些其他包不如 bigmemory 那样令人愉快。Ah, there are some things that are impossible in this life, and there are some that are misunderstood and lead to unpleasant situations. @Roman is right: a matrix must be of one atomic type. It's not a dataframe.
Since a matrix must be of one type, attempting to snooker
bigmemory
to handle multiple types is, in itself, a bad thing. Could it be done? I'm not going there. Why? Because everything else will assume that it's getting a matrix, not a dataframe. That will lead to more questions and more sorrow.Now, what you can do is to identify the types of each of the columns, and generate a set of distinct bigmemory files, each containing the items that are of a particular type. E.g. charBM = character big matrix, intBM = integer big matrix, and so on. Then, you may be able to develop have a wrapper that produces a data frame out of all of this. Still I don't recommend that: treat the different items as what they are, or coerce homogeneity if you can, rather than try to produce a big dataframe griffin.
@mdsumner is correct in suggesting
ff
. Another storage option is HDF5, which you can access throughncdf4
in R. Unfortunately, these other packages are not as pleasant asbigmemory
.根据帮助文件,没有。
我不熟悉这个包/函数,但在 R 中,矩阵只能有一种原子类型(与 data.frames 不同)。
According to the help file, no.
I'm not familiar with this package/function, but in R, matrices can have only one atomic type (unlike e.g. data.frames).
最好的解决方案是逐行读取文件并解析它,这样读取过程将几乎线性地占用内存量。
The best solution is to read the file line by line and parse it, in this way the reading process will occupy an amount of memory almost linear.