研发ggplot2 - 无法分配大小为 128.0 Mb 的向量

发布于 2024-11-11 14:02:52 字数 888 浏览 3 评论 0原文

我有一个 4.5MB（9,223,136 行）的文件，其中包含以下信息：

0       0
0.0147938       3.67598e-07
0.0226194       7.35196e-07
0.0283794       1.10279e-06
0.033576        1.47039e-06
0.0383903       1.83799e-06
0.0424806       2.20559e-06
0.0465545       2.57319e-06
0.0499759       2.94079e-06

在每一列中，一个值表示从 0 到 100 的值，表示百分比。我的目标是在 ggplot2 中绘制一个图形来查看它们之间的百分比（例如，第 1 列的 20% 是第 2 列达到的百分比）。这是我的 R 脚本：

library(ggplot2)
dataset=read.table("~/R/datasets/cumul.txt.gz")
p <- ggplot(dataset,aes(V2,V1))
p <- p + geom_line()
p <- p + scale_x_continuous(formatter="percent") + scale_y_continuous(formatter="percent")
p <- p + theme_bw()
ggsave("~/R/grafs/cumul.png")

我遇到问题，因为每次运行此 R 都会耗尽内存，并给出错误：“无法分配大小为 128.0 Mb 的向量”。我在 Linux 机器上运行 32 位 R，并且有大约 4GB 的可用内存。

我想到了一种解决方法，包括降低这些值的精度（通过四舍五入）并消除重复的行，以便数据集中的行数更少。您能给我一些关于如何做到这一点的建议吗？

原文

I have a file of 4.5MB (9,223,136 lines) with the following information:

0       0
0.0147938       3.67598e-07
0.0226194       7.35196e-07
0.0283794       1.10279e-06
0.033576        1.47039e-06
0.0383903       1.83799e-06
0.0424806       2.20559e-06
0.0465545       2.57319e-06
0.0499759       2.94079e-06

In each column a value is represented a value from 0 to 100 meaning a percentage. My goal is to draw a graphic in ggplot2 to see check the percentages between them (e.g. with 20% of column1 what is the percentage achieved on column2). Heres is my R script:

library(ggplot2)
dataset=read.table("~/R/datasets/cumul.txt.gz")
p <- ggplot(dataset,aes(V2,V1))
p <- p + geom_line()
p <- p + scale_x_continuous(formatter="percent") + scale_y_continuous(formatter="percent")
p <- p + theme_bw()
ggsave("~/R/grafs/cumul.png")

I'm having a problem because every time i run this R runs out of memory, giving the error: "Cannot allocate vector of size 128.0 Mb ". I'm running 32-bit R on a Linux machine and i have about 4gb free memory.

I thought on a workaround that consists of reducing the precision of these values (by rounding them) and eliminate duplicate lines so that i have less lines on the dataset. Could you give me some advice on how to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

做个ˇ局外人 2024-11-18 14:02:52

您确定 4.5MB 文件中有 900 万行（编辑：也许您的文件是 4.5 GB？？）？它必须被严重压缩 - 当我创建一个大小十分之一的文件时，它是 115Mb ...

n <- 9e5
set.seed(1001)
z <- rnorm(9e5)
z <- cumsum(z)/sum(z)
d <- data.frame(V1=seq(0,1,length=n),V2=z)
ff <- gzfile("lgfile2.gz", "w")
write.table(d,row.names=FALSE,col.names=FALSE,file=ff)
close(ff)
file.info("lgfile2.gz")["size"]

很难从您提供的信息中判断数据集中有什么样的“重复行”.. .unique(dataset) 将仅提取唯一的行，但这可能没有用。我可能会首先将数据集简单地稀疏 100 或 1000 倍：

smdata <- dataset[seq(1,nrow(dataset),by=1000),]

然后看看接下来的情况如何。（编辑：忘记了逗号！）

大型数据集的图形表示通常是一个挑战。一般来说，你会更好：

在使用专门的图形类型（密度图、轮廓图、六边形分箱）绘制数据之前以某种方式汇总数据
，该图形类型使用基本图形减少数据
，该图形使用“绘制并忘记”模型（除非图形记录）打开，例如在 Windows 中），而不是lattice/ggplot/grid 图形，后者保存完整的图形对象，然后
使用光栅或位图图形（PNG 等）渲染它，后者仅记录图像中每个像素的状态，而不是向量图形，保存所有对象，无论它们是否重叠

Are you sure you have 9 million lines in a 4.5MB file (edit: perhaps your file is 4.5 GB??)? It must be heavily compressed -- when I create a file that is one tenth the size, it's 115Mb ...

n <- 9e5
set.seed(1001)
z <- rnorm(9e5)
z <- cumsum(z)/sum(z)
d <- data.frame(V1=seq(0,1,length=n),V2=z)
ff <- gzfile("lgfile2.gz", "w")
write.table(d,row.names=FALSE,col.names=FALSE,file=ff)
close(ff)
file.info("lgfile2.gz")["size"]

It's hard to tell from the information you've given what kind of "duplicate lines" you have in your data set ... unique(dataset) will extract just the unique rows, but that may not be useful. I would probably start by simply thinning the data set by a factor of 100 or 1000:

smdata <- dataset[seq(1,nrow(dataset),by=1000),]

and see how it goes from there. (edit: forgot a comma!)

Graphical representations of large data sets are often a challenge. In general you will be better off:

summarizing the data somehow before plotting it
using a specialized graphical type (density plots, contours, hexagonal binning) that reduces the data
using base graphics, which uses a "draw and forget" model (unless graphics recording is turned on, e.g. in Windows), rather than lattice/ggplot/grid graphics, which save a complete graphical object and then render it
using raster or bitmap graphics (PNG etc.), which only record the state of each pixel in the image, rather than vector graphics, which save all objects whether they overlap or not

回复收藏 0 原文

~没有更多了~