如何使用 bigmemory 包加载具有混合类型列的大 csv 文件

发布于 2024-11-28 04:37:16 字数 103 浏览 2 评论 0原文

有没有办法结合使用 bigmemory 包中的 scan() 和 read.big.matrix() 来读取具有混合类型列的 200 MB .csv 文件,以便结果是具有整数、字符、和数字列?

Is there a way to combine the use of scan() and read.big.matrix() from the bigmemory package to read in a 200 MB .csv file with mixed-type columns so that the result is a dataframe with integer, character, and numeric columns?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

淡写薰衣草的香 2024-12-05 04:37:16

为此尝试 ff 包。

library(ff)
help(read.table.ffdf)

函数“read.table.ffdf”将分离的平面文件读取到“ffdf”中
对象,非常类似于(并使用)“read.table”。还可以
使用任何方便的包装器,如“read.csv”,并提供
R 常用的其自己的方便包装器(例如“read.csv.ffdf”)
包装纸。

对于 200Mb 来说,任务应该像这样简单。

 x <- read.csv.ffdf(file=csvfile)

(对于更大的文件,可能需要您研究一些配置选项,具体取决于您的计算机和操作系统)。

Try the ff package for this.

library(ff)
help(read.table.ffdf)

Function ‘read.table.ffdf’ reads separated flat files into ‘ffdf’
objects, very much like (and using) ‘read.table’. It can also
work with any convenience wrappers like ‘read.csv’ and provides
its own convenience wrapper (e.g. ‘read.csv.ffdf’) for R's usual
wrappers.

For 200Mb it should be as simple a task as this.

 x <- read.csv.ffdf(file=csvfile)

(For much bigger files it will likely require that you investigate some of the configuration options, depending on your machine and OS).

£冰雨忧蓝° 2024-12-05 04:37:16

啊,这辈子有些事是不可能的,有些事是被人误解而酿成不愉快的事情。 @Roman是对的:矩阵必须是一种原子类型。它不是一个数据框。

由于矩阵必须是一种类型,因此尝试使用大内存来处理多种类型本身就是一件坏事。能做到吗?我不会去那里。为什么?因为其他一切都会假设它正在获取一个矩阵,而不是一个数据帧。这将导致更多的问题和更多的悲伤。

现在,您可以做的是识别每个列的类型,并生成一组不同的大内存文件,每个文件都包含特定类型的项目。例如 charBM = 字符大矩阵,intBM = 整数大矩阵,等等。然后,您也许可以开发一个包装器,从所有这些中生成一个数据帧。但我仍然不建议:将不同的项目按其本来的样子对待,或者如果可以的话强制同质化,而不是尝试生成一个大数据帧格里芬。

@mdsumner 的建议是正确的 ff。另一个存储选项是 HDF5,您可以通过 R 中的 ncdf4 访问它。不幸的是,这些其他包不如 bigmemory 那样令人愉快。

Ah, there are some things that are impossible in this life, and there are some that are misunderstood and lead to unpleasant situations. @Roman is right: a matrix must be of one atomic type. It's not a dataframe.

Since a matrix must be of one type, attempting to snooker bigmemory to handle multiple types is, in itself, a bad thing. Could it be done? I'm not going there. Why? Because everything else will assume that it's getting a matrix, not a dataframe. That will lead to more questions and more sorrow.

Now, what you can do is to identify the types of each of the columns, and generate a set of distinct bigmemory files, each containing the items that are of a particular type. E.g. charBM = character big matrix, intBM = integer big matrix, and so on. Then, you may be able to develop have a wrapper that produces a data frame out of all of this. Still I don't recommend that: treat the different items as what they are, or coerce homogeneity if you can, rather than try to produce a big dataframe griffin.

@mdsumner is correct in suggesting ff. Another storage option is HDF5, which you can access through ncdf4 in R. Unfortunately, these other packages are not as pleasant as bigmemory.

浪荡不羁 2024-12-05 04:37:16

根据帮助文件,没有。

文件必须仅包含一种原子类型(例如,全部为整数)。
作为用户,您应该知道您的文件是否有行和/或列
名称和选项的各种组合应该会有所帮助
获得所需的行为。

我不熟悉这个包/函数,但在 R 中,矩阵只能有一种原子类型(与 data.frames 不同)。

According to the help file, no.

Files must contain only one atomic type (all integer, for example).
You, the user, should know whether your file has row and/or column
names, and various combinations of options should be helpful in
obtaining the desired behavior.

I'm not familiar with this package/function, but in R, matrices can have only one atomic type (unlike e.g. data.frames).

天赋异禀 2024-12-05 04:37:16

最好的解决方案是逐行读取文件并解析它,这样读取过程将几乎线性地占用内存量。

The best solution is to read the file line by line and parse it, in this way the reading process will occupy an amount of memory almost linear.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文