optimization memory-management r premature-optimization

R中表切片占用内存吗？

发布于 2024-10-22 20:50:32 字数 460 浏览 5 评论 0原文

如果我使用列名获取表的切片，R 是否会分配内存以将切片保存在新位置？具体来说，我有一个包含深度 1 和深度 2 等列的表。我想添加包含两者的最大值和最小值的列。我有两种方法：

dd = dat[,c("depth1","depth2")]
dat$mindepth = apply(dd,1,min)
dat$maxdepth = apply(dd,1,max)
remove(dd)

或者

dat$mindepth = apply(dat[,c("depth1","depth2")],1,min)
dat$maxdepth = apply(dat[,c("depth1","depth2")],1,max)

如果我没有用完新内存，我宁愿只使用一次切片，否则我想保存重新分配。哪一个更好？在处理大型数据集时，内存问题可能很关键，所以请不要用万恶模因的根源来否决这一点。

原文

If I take a slice of a table using, say the column names, does R allocate memory to hold the slice in a new location? Specifically, I have a table with columns depth1 and depth2, among others. I want to add columns which contain the max and min of the two. I have 2 approaches:

dd = dat[,c("depth1","depth2")]
dat$mindepth = apply(dd,1,min)
dat$maxdepth = apply(dd,1,max)
remove(dd)

dat$mindepth = apply(dat[,c("depth1","depth2")],1,min)
dat$maxdepth = apply(dat[,c("depth1","depth2")],1,max)

If I am not using up new memory, I'd rather take the slice only once, otherwise I would like save the reallocation. Which one is better? Memory issues can be critical when dealing with large datasets so please don't downvote this with the root of all evil meme.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

金橙橙 2024-10-29 20:50:32

我知道这实际上并不能回答问题的主旨（@hadley 已经做到了这一点并且值得赞扬），但是对于您建议的选项还有其他选择。在这里，您可以使用 pmin() 和 pmax() 作为另一种解决方案，并使用 with() 或 within() 我们可以在不显式子集创建 dd 的情况下完成此操作。

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
R> 
R> dat
       depth1    depth2   mindepth  maxdepth
1  0.26550866 0.2059746 0.20597457 0.2655087
2  0.37212390 0.1765568 0.17655675 0.3721239
3  0.57285336 0.6870228 0.57285336 0.6870228
4  0.90820779 0.3841037 0.38410372 0.9082078
5  0.20168193 0.7698414 0.20168193 0.7698414
6  0.89838968 0.4976992 0.49769924 0.8983897
7  0.94467527 0.7176185 0.71761851 0.9446753
8  0.66079779 0.9919061 0.66079779 0.9919061
9  0.62911404 0.3800352 0.38003518 0.6291140
10 0.06178627 0.7774452 0.06178627 0.7774452

我们可以查看使用 tracemem() 进行了多少复制，但仅如果您的 R 是在激活以下配置选项的情况下编译的 --enable-memory-profiling。

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2641cd8>"
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
tracemem[0x2641cd8 -> 0x2641a00]: within.data.frame within 
tracemem[0x2641a00 -> 0x2641878]: [<-.data.frame [<- within.data.frame within 
R> tracemem(dat)
[1] "<0x2657bc8>"
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
tracemem[0x2657bc8 -> 0x2c765d8]: within.data.frame within 
tracemem[0x2c765d8 -> 0x2c764b8]: [<-.data.frame [<- within.data.frame within

因此我们看到 R 在每次 within() 调用期间复制了 dat 两次。将其与您的两个建议进行比较：

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2e1ddd0>"
R> dd <- dat[,c("depth1","depth2")]
R> tracemem(dd)
[1] "<0x2df01a0>"
R> dat$mindepth = apply(dd,1,min)
tracemem[0x2df01a0 -> 0x2cf97d8]: as.matrix.data.frame as.matrix apply 
tracemem[0x2e1ddd0 -> 0x2cc0ab0]: 
tracemem[0x2cc0ab0 -> 0x2cc0b20]: 
lt;-.data.frame lt;- 
tracemem[0x2cc0b20 -> 0x2cc0bc8]: 
lt;-.data.frame lt;- 
R> tracemem(dat)
[1] "<0x26b93c8>"
R> dat$maxdepth = apply(dd,1,max)
tracemem[0x2df01a0 -> 0x2cc0e30]: as.matrix.data.frame as.matrix apply 
tracemem[0x26b93c8 -> 0x26742c8]: 
tracemem[0x26742c8 -> 0x2674358]: 
lt;-.data.frame lt;- 
tracemem[0x2674358 -> 0x2674478]: 
lt;-.data.frame lt;-

这里，dd 在每次调用 apply 时都会复制一次，因为 apply() 转换 dd 到矩阵，然后再继续。每个 tracemem 输出块中的最后三行表示正在制作三个 dat 副本以插入新列。

那你的第二个选择呢？

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x268bc88>"
R> dat$mindepth <- apply(dat[,c("depth1","depth2")],1,min)
tracemem[0x268bc88 -> 0x26376b0]: 
tracemem[0x26376b0 -> 0x2637720]: 
lt;-.data.frame lt;- 
tracemem[0x2637720 -> 0x2637790]: 
lt;-.data.frame lt;- 
R> tracemem(dat)
[1] "<0x2466d40>"
R> dat$maxdepth <- apply(dat[,c("depth1","depth2")],1,max)
tracemem[0x2466d40 -> 0x22ae0d8]: 
tracemem[0x22ae0d8 -> 0x22ae1f8]: 
lt;-.data.frame lt;- 
tracemem[0x22ae1f8 -> 0x22ae318]: 
lt;-.data.frame lt;-

在这里，此版本避免了设置 dd 所涉及的副本，但在所有其他方面与您之前的建议类似。

我们还能做得更好吗？是的，一种简单的方法是使用我开始使用的 within() 选项，但执行这两个语句以在对 within() 的一次调用：

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x21c4158>"
R> dat <- within(dat, { mindepth <- pmin(depth1, depth2)
+                      maxdepth <- pmax(depth1, depth2) })
tracemem[0x21c4158 -> 0x21c44a0]: within.data.frame within 
tracemem[0x21c44a0 -> 0x21c4628]: [<-.data.frame [<- within.data.frame within

在此版本中，与原始 within() 的 4 个副本相比，我们仅调用 dat 的两个副本版本。

如果我们将 dat 强制转换为矩阵然后进行插入呢？

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x1f29c70>"
R> mat <- as.matrix.data.frame(dat)
tracemem[0x1f29c70 -> 0x1f09768]: as.matrix.data.frame 
R> tracemem(mat)
[1] "<0x245ff30>"
R> mat <- cbind(mat, pmin(mat[,1], mat[,2]), pmax(mat[,1], mat[,2]))
R>

这是一个改进，因为在强制转换为矩阵时，我们只需要花费单个数据副本的成本。我通过直接调用 as.matrix.data.frame() 方法进行了一些作弊。如果我们只使用 as.matrix()，我们就会产生 mat 的另一个副本。

这凸显了矩阵使用起来比数据帧快得多的原因之一。

I know this doesn't actually answer the main thrust of the question (@hadley has done that and deserves credit), but there are other options to those you suggest. Here you could use pmin() and pmax() as another solution, and using with() or within() we can do it without explicit subsetting to create a dd.

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
R> 
R> dat
       depth1    depth2   mindepth  maxdepth
1  0.26550866 0.2059746 0.20597457 0.2655087
2  0.37212390 0.1765568 0.17655675 0.3721239
3  0.57285336 0.6870228 0.57285336 0.6870228
4  0.90820779 0.3841037 0.38410372 0.9082078
5  0.20168193 0.7698414 0.20168193 0.7698414
6  0.89838968 0.4976992 0.49769924 0.8983897
7  0.94467527 0.7176185 0.71761851 0.9446753
8  0.66079779 0.9919061 0.66079779 0.9919061
9  0.62911404 0.3800352 0.38003518 0.6291140
10 0.06178627 0.7774452 0.06178627 0.7774452

We can look at how much copying goes on with tracemem() but only if your R was compiled with the following configure option activated --enable-memory-profiling.

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2641cd8>"
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
tracemem[0x2641cd8 -> 0x2641a00]: within.data.frame within 
tracemem[0x2641a00 -> 0x2641878]: [<-.data.frame [<- within.data.frame within 
R> tracemem(dat)
[1] "<0x2657bc8>"
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
tracemem[0x2657bc8 -> 0x2c765d8]: within.data.frame within 
tracemem[0x2c765d8 -> 0x2c764b8]: [<-.data.frame [<- within.data.frame within

So we see that R copied dat twice during each within() call. Compare that with your two suggestions:

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2e1ddd0>"
R> dd <- dat[,c("depth1","depth2")]
R> tracemem(dd)
[1] "<0x2df01a0>"
R> dat$mindepth = apply(dd,1,min)
tracemem[0x2df01a0 -> 0x2cf97d8]: as.matrix.data.frame as.matrix apply 
tracemem[0x2e1ddd0 -> 0x2cc0ab0]: 
tracemem[0x2cc0ab0 -> 0x2cc0b20]: 
lt;-.data.frame lt;- 
tracemem[0x2cc0b20 -> 0x2cc0bc8]: 
lt;-.data.frame lt;- 
R> tracemem(dat)
[1] "<0x26b93c8>"
R> dat$maxdepth = apply(dd,1,max)
tracemem[0x2df01a0 -> 0x2cc0e30]: as.matrix.data.frame as.matrix apply 
tracemem[0x26b93c8 -> 0x26742c8]: 
tracemem[0x26742c8 -> 0x2674358]: 
lt;-.data.frame lt;- 
tracemem[0x2674358 -> 0x2674478]: 
lt;-.data.frame lt;-

Here, dd is copied once in each call to apply because apply() converts dd to a matrix before proceeding. The final three lines in the each block of tracemem output indicates three copies of dat are being made to insert the new column.

What about your second option?

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x268bc88>"
R> dat$mindepth <- apply(dat[,c("depth1","depth2")],1,min)
tracemem[0x268bc88 -> 0x26376b0]: 
tracemem[0x26376b0 -> 0x2637720]: 
lt;-.data.frame lt;- 
tracemem[0x2637720 -> 0x2637790]: 
lt;-.data.frame lt;- 
R> tracemem(dat)
[1] "<0x2466d40>"
R> dat$maxdepth <- apply(dat[,c("depth1","depth2")],1,max)
tracemem[0x2466d40 -> 0x22ae0d8]: 
tracemem[0x22ae0d8 -> 0x22ae1f8]: 
lt;-.data.frame lt;- 
tracemem[0x22ae1f8 -> 0x22ae318]: 
lt;-.data.frame lt;-

Here this version avoids the copy involved in setting up dd, but in all other respects is similar to your previous suggestion.

Can we do any better? Yes, and one simple way is to use the within() option I started with but execute both statements to create new mindepth and maxdepth variables in the one call to within():

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x21c4158>"
R> dat <- within(dat, { mindepth <- pmin(depth1, depth2)
+                      maxdepth <- pmax(depth1, depth2) })
tracemem[0x21c4158 -> 0x21c44a0]: within.data.frame within 
tracemem[0x21c44a0 -> 0x21c4628]: [<-.data.frame [<- within.data.frame within

In this version we only invoke two copies of dat compared to the 4 copies of the original within() version.

What about if we coerce dat to a matrix and then do the insertions?

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x1f29c70>"
R> mat <- as.matrix.data.frame(dat)
tracemem[0x1f29c70 -> 0x1f09768]: as.matrix.data.frame 
R> tracemem(mat)
[1] "<0x245ff30>"
R> mat <- cbind(mat, pmin(mat[,1], mat[,2]), pmax(mat[,1], mat[,2]))
R>

That is an improvement as we only incur the cost of the single copy of dat when coercing to a matrix. I cheated a bit by calling the as.matrix.data.frame() method directly. If we'd just used as.matrix() we'd have incurred another copy of mat.

This highlights one of the reasons why matrices are so much faster to use than data frames.

回复收藏 0 原文

~没有更多了~