研发ggplot2 - 无法分配大小为 128.0 Mb 的向量
我有一个 4.5MB(9,223,136 行)的文件,其中包含以下信息:
0 0
0.0147938 3.67598e-07
0.0226194 7.35196e-07
0.0283794 1.10279e-06
0.033576 1.47039e-06
0.0383903 1.83799e-06
0.0424806 2.20559e-06
0.0465545 2.57319e-06
0.0499759 2.94079e-06
在每一列中,一个值表示从 0 到 100 的值,表示百分比。我的目标是在 ggplot2 中绘制一个图形来查看它们之间的百分比(例如,第 1 列的 20% 是第 2 列达到的百分比)。这是我的 R 脚本:
library(ggplot2)
dataset=read.table("~/R/datasets/cumul.txt.gz")
p <- ggplot(dataset,aes(V2,V1))
p <- p + geom_line()
p <- p + scale_x_continuous(formatter="percent") + scale_y_continuous(formatter="percent")
p <- p + theme_bw()
ggsave("~/R/grafs/cumul.png")
我遇到问题,因为每次运行此 R 都会耗尽内存,并给出错误:“无法分配大小为 128.0 Mb 的向量”。我在 Linux 机器上运行 32 位 R,并且有大约 4GB 的可用内存。
我想到了一种解决方法,包括降低这些值的精度(通过四舍五入)并消除重复的行,以便数据集中的行数更少。您能给我一些关于如何做到这一点的建议吗?
I have a file of 4.5MB (9,223,136 lines) with the following information:
0 0
0.0147938 3.67598e-07
0.0226194 7.35196e-07
0.0283794 1.10279e-06
0.033576 1.47039e-06
0.0383903 1.83799e-06
0.0424806 2.20559e-06
0.0465545 2.57319e-06
0.0499759 2.94079e-06
In each column a value is represented a value from 0 to 100 meaning a percentage. My goal is to draw a graphic in ggplot2 to see check the percentages between them (e.g. with 20% of column1 what is the percentage achieved on column2). Heres is my R script:
library(ggplot2)
dataset=read.table("~/R/datasets/cumul.txt.gz")
p <- ggplot(dataset,aes(V2,V1))
p <- p + geom_line()
p <- p + scale_x_continuous(formatter="percent") + scale_y_continuous(formatter="percent")
p <- p + theme_bw()
ggsave("~/R/grafs/cumul.png")
I'm having a problem because every time i run this R runs out of memory, giving the error: "Cannot allocate vector of size 128.0 Mb ". I'm running 32-bit R on a Linux machine and i have about 4gb free memory.
I thought on a workaround that consists of reducing the precision of these values (by rounding them) and eliminate duplicate lines so that i have less lines on the dataset. Could you give me some advice on how to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您确定 4.5MB 文件中有 900 万行(编辑:也许您的文件是 4.5 GB??)?它必须被严重压缩 - 当我创建一个大小十分之一的文件时,它是 115Mb ...
很难从您提供的信息中判断数据集中有什么样的“重复行”.. .
unique(dataset)
将仅提取唯一的行,但这可能没有用。我可能会首先将数据集简单地稀疏 100 或 1000 倍:然后看看接下来的情况如何。 (编辑:忘记了逗号!)
大型数据集的图形表示通常是一个挑战。一般来说,你会更好:
Are you sure you have 9 million lines in a 4.5MB file (edit: perhaps your file is 4.5 GB??)? It must be heavily compressed -- when I create a file that is one tenth the size, it's 115Mb ...
It's hard to tell from the information you've given what kind of "duplicate lines" you have in your data set ...
unique(dataset)
will extract just the unique rows, but that may not be useful. I would probably start by simply thinning the data set by a factor of 100 or 1000:and see how it goes from there. (edit: forgot a comma!)
Graphical representations of large data sets are often a challenge. In general you will be better off: