R中表切片占用内存吗?
如果我使用列名获取表的切片,R 是否会分配内存以将切片保存在新位置?具体来说,我有一个包含深度 1 和深度 2 等列的表。我想添加包含两者的最大值和最小值的列。我有两种方法:
dd = dat[,c("depth1","depth2")]
dat$mindepth = apply(dd,1,min)
dat$maxdepth = apply(dd,1,max)
remove(dd)
或者
dat$mindepth = apply(dat[,c("depth1","depth2")],1,min)
dat$maxdepth = apply(dat[,c("depth1","depth2")],1,max)
如果我没有用完新内存,我宁愿只使用一次切片,否则我想保存重新分配。哪一个更好?在处理大型数据集时,内存问题可能很关键,所以请不要用万恶模因的根源来否决这一点。
If I take a slice of a table using, say the column names, does R allocate memory to hold the slice in a new location? Specifically, I have a table with columns depth1 and depth2, among others. I want to add columns which contain the max and min of the two. I have 2 approaches:
dd = dat[,c("depth1","depth2")]
dat$mindepth = apply(dd,1,min)
dat$maxdepth = apply(dd,1,max)
remove(dd)
or
dat$mindepth = apply(dat[,c("depth1","depth2")],1,min)
dat$maxdepth = apply(dat[,c("depth1","depth2")],1,max)
If I am not using up new memory, I'd rather take the slice only once, otherwise I would like save the reallocation. Which one is better? Memory issues can be critical when dealing with large datasets so please don't downvote this with the root of all evil meme.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我知道这实际上并不能回答问题的主旨(@hadley 已经做到了这一点并且值得赞扬),但是对于您建议的选项还有其他选择。在这里,您可以使用
pmin()
和pmax()
作为另一种解决方案,并使用with()
或within() 我们可以在不显式子集创建
dd
的情况下完成此操作。我们可以查看使用
tracemem()
进行了多少复制,但仅如果您的 R 是在激活以下配置选项的情况下编译的--enable-memory-profiling
。因此我们看到 R 在每次
within()
调用期间复制了dat
两次。将其与您的两个建议进行比较:这里,
dd
在每次调用apply
时都会复制一次,因为apply()
转换dd 到矩阵,然后再继续。每个
tracemem
输出块中的最后三行表示正在制作三个dat
副本以插入新列。那你的第二个选择呢?
在这里,此版本避免了设置
dd
所涉及的副本,但在所有其他方面与您之前的建议类似。我们还能做得更好吗?是的,一种简单的方法是使用我开始使用的
within()
选项,但执行这两个语句以在对within()
的一次调用:在此版本中,与原始
within()
的 4 个副本相比,我们仅调用dat
的两个副本版本。如果我们将
dat
强制转换为矩阵然后进行插入呢?这是一个改进,因为在强制转换为矩阵时,我们只需要花费单个数据副本的成本。我通过直接调用
as.matrix.data.frame()
方法进行了一些作弊。如果我们只使用as.matrix()
,我们就会产生mat
的另一个副本。这凸显了矩阵使用起来比数据帧快得多的原因之一。
I know this doesn't actually answer the main thrust of the question (@hadley has done that and deserves credit), but there are other options to those you suggest. Here you could use
pmin()
andpmax()
as another solution, and usingwith()
orwithin()
we can do it without explicit subsetting to create add
.We can look at how much copying goes on with
tracemem()
but only if your R was compiled with the following configure option activated--enable-memory-profiling
.So we see that R copied
dat
twice during eachwithin()
call. Compare that with your two suggestions:Here,
dd
is copied once in each call toapply
becauseapply()
convertsdd
to a matrix before proceeding. The final three lines in the each block oftracemem
output indicates three copies ofdat
are being made to insert the new column.What about your second option?
Here this version avoids the copy involved in setting up
dd
, but in all other respects is similar to your previous suggestion.Can we do any better? Yes, and one simple way is to use the
within()
option I started with but execute both statements to create newmindepth
andmaxdepth
variables in the one call towithin()
:In this version we only invoke two copies of
dat
compared to the 4 copies of the originalwithin()
version.What about if we coerce
dat
to a matrix and then do the insertions?That is an improvement as we only incur the cost of the single copy of
dat
when coercing to a matrix. I cheated a bit by calling theas.matrix.data.frame()
method directly. If we'd just usedas.matrix()
we'd have incurred another copy ofmat
.This highlights one of the reasons why matrices are so much faster to use than data frames.