在函数中写入大型矩阵 - 快与慢

发布于 2024-12-17 00:46:42 字数 1289 浏览 0 评论 0原文

[问题已根据回复进行修改]

感谢您的回复。我的问题不清楚,对此我表示歉意。

我会尽力提供有关我们情况的更多详细信息。我们有c。我们在环境中保存的 100 个矩阵。每一个都非常大。如果可能的话,我们希望在执行更新时避免对这些矩阵进行任何复制。我们经常遇到 2GB 内存限制,因此这对我们来说非常重要。

所以我们的两个要求是 1)避免复制和 2)通过名称间接寻址矩阵。速度虽然很重要,但它是一个可以通过避免复制来解决的次要问题。

在我看来,汤米的解决方案涉及创建一个副本(尽管它确实完全回答了我实际的原始问题,所以我是有错的人)。

下面的代码对我们来说似乎是最明显的,但它显然创建了一个副本(如 memory.size 增加所示)

myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)

testfnDirect <- function(paramEnv) {
    print(memory.size())

    for (i in 1:300) {
        temp <- paramEnv$testmat1[10,] 
        paramEnv$testmat1[10,] <- temp * 0
    }   
    print(memory.size())
}
system.time(testfnDirect(myenv))

使用 with 关键字似乎可以避免这种情况,如下所示

myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)

testfnDirect <- function(paramEnv) {
    print(gc())
    varname <- "testmat1" # unused, but see text
    with (paramEnv, {
        for (i in 1:300) {
            temp <- testmat1[10,] 
            testmat1[10,] <- temp * 0
        }
    })
    print(gc())
}
system.time(testfnDirect(myenv))

:代码的工作方式是直接按名称寻址 testmat1。我们的问题是我们需要间接解决它(我们事先不知道我们将更新哪些矩阵)。

有没有办法修改 testfnDirect 以便我们使用变量 varname 而不是硬编码 testmat

[Question amended following responses]

Thanks for the responses. I was unclear in my question, for which I apologise.

I'll try to give more details of our situation. We have c. 100 matrices that we keep in an environment. Each is very large. If at all possible we want to avoid any copying of these matrices when we perform updates. We're often running up against the 2GB memory limit, so this is very important for us.

So our two requirements are 1) avoiding copies and 2) addressing the matrices indirectly by name. Speed, whilst important, is a side-issue that would be solved by avoiding the copying.

It appears to me that Tommy's solution involved creating a copy (though it did entirely answer my actual original question, so I'm the one at fault).

The code below is what seems most obvious to us, but it clearly creates a copy (as shown by the memory.size increase)

myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)

testfnDirect <- function(paramEnv) {
    print(memory.size())

    for (i in 1:300) {
        temp <- paramEnv$testmat1[10,] 
        paramEnv$testmat1[10,] <- temp * 0
    }   
    print(memory.size())
}
system.time(testfnDirect(myenv))

Using the with keyword seems to avoid this, as shown below:

myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)

testfnDirect <- function(paramEnv) {
    print(gc())
    varname <- "testmat1" # unused, but see text
    with (paramEnv, {
        for (i in 1:300) {
            temp <- testmat1[10,] 
            testmat1[10,] <- temp * 0
        }
    })
    print(gc())
}
system.time(testfnDirect(myenv))

However, that code works by addressing testmat1 directly by name. Our problem is that we need to address it indirectly (we don't know in advance which matrices we'll be updating).

Is there a way of amending testfnDirect such that we use the variable varname rather than hardcoding testmat

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

酒解孤独 2024-12-24 00:46:42

好吧,如果您能解释为什么第一个解决方案不行,那就太好了……它看起来更整洁并且运行速度更快。

尝试回答以下问题:

  1. foo[bar][baz] <- 42 这样的“嵌套替换”操作非常复杂,并且针对某些情况进行了优化以避免复制。但您的特定用例很可能没有优化。这将导致大量副本和性能损失。

    测试该理论的一种方法是在测试之前调用 gcinfo(TRUE)。然后您将看到第一个解决方案触发 2 次垃圾收集,第二个解决方案触发大约 160 次!

  2. 这是第二个解决方案的变体,它将环境转换为列表,执行其操作,然后转换回环境。它与您的第一个解决方案一样快。

代码:

testfnList <- function() {
    mylist <- as.list(myenv, all.names=TRUE)

    thisvar <- "testmat2"
    for (i in 1:300) {
        temp <- mylist[[thisvar]][10,]
        mylist[[thisvar]][10,] <- temp * 0
    }
    
    myenv <<- as.environment(mylist)
}
system.time(testfnList()) # 0.02 secs

...如果将 myenv 作为参数传递给函数,当然会更简洁。
一个小的改进(如果循环很多次,而不仅仅是 300 次)是按数字而不是名称进行索引(不适用于环境,但适用于列表)。只需更改 thisvar

thisvar <- match("testmat2", names(mylist))

Well, it would be nice if you could explain why the first solution isn't OK... It looks much neater AND runs faster.

To try to answer the questions:

  1. A "nested replacement" operation like foo[bar][baz] <- 42 is very complex, and is optimized for certain cases to avoid copying. But it is very likely that your particular use case is not optimized. That would lead to lots of copies, and loss of performance.

    A way to test that theory is to call gcinfo(TRUE) before your tests. You'll then see that the first solution triggers 2 garbage collects, and the second one triggers around 160!

  2. Here's a variant of your second solution that converts the environment to a list, does its thing and the converts back to an environment. It is as fast as your first solution.

Code:

testfnList <- function() {
    mylist <- as.list(myenv, all.names=TRUE)

    thisvar <- "testmat2"
    for (i in 1:300) {
        temp <- mylist[[thisvar]][10,]
        mylist[[thisvar]][10,] <- temp * 0
    }
    
    myenv <<- as.environment(mylist)
}
system.time(testfnList()) # 0.02 secs

...it would of course be neater if you passed myenv to the function as an argument.
A small improvement (if you loop a lot, not just 300 times) would be to index by number instead of name (doesn't work for environments, but for lists). Just change thisvar:

thisvar <- match("testmat2", names(mylist))
半世晨晓 2024-12-24 00:46:42

最近对“data.table”包的更改专门是为了避免在修改值时进行复制。因此,如果您的应用程序可以处理其他操作的 data.tables,那么这可能是一个解决方案。 (而且速度会很快。)

A fairly recent change to the 'data.table' package was specifically to avoid copying when modifying values. So if your application can handle data.tables for the other operations, that could be a solution. (And it would be fast.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文