memory-management r classification decision-tree

如何从 party:::ctree 模型中删除训练数据？

发布于 2024-12-01 05:13:03 字数 2048 浏览 6 评论 0原文

我创建了几个 ctree 模型（大约 40 到 80 个），我想经常对其进行评估。

一个问题是模型对象非常大（40 个模型需要超过 2.8G 的内存），在我看来，它们存储了训练数据，可能作为 modelname@data 和 modelname@responses，而不仅仅是相关信息来预测新数据。

大多数其他 R 学习包都有是否将数据包含在模型对象中的可配置选项，但我在文档中找不到任何提示。我还尝试分配空的 ModelEnv 对象，

modelname@data <- new("ModelEnv")

但对相应 RData 文件的大小没有影响。

任何人都知道 ctree 是否真的存储训练数据以及如何从 ctree 模型中删除与新预测无关的所有数据，以便我可以将其中的许多数据放入内存中？

非常感谢，

Stefan，

谢谢您的反馈，这已经非常有帮助了。

我使用 dput 和 str 深入查看该对象，发现模型中不包含任何训练数据，但有一个 responses 插槽，其中似乎有训练标签和行名称。无论如何，我注意到每个节点都有每个训练样本的权重向量。检查代码一段时间后，我最终用谷歌搜索了一下，在 party NEWS 日志中发现了以下注释：

         CHANGES IN party VERSION 0.9-13 (2007-07-23)

o   update `mvt.f'

o   improve the memory footprint of RandomForest objects
    substancially (by removing the weights slots from each node).

事实证明，party 包中有一个 C 函数可以删除这些权重，称为R_remove_weights 具有以下定义：

SEXP R_remove_weights(SEXP subtree, SEXP removestats) {
    C_remove_weights(subtree, LOGICAL(removestats)[0]);
    return(R_NilValue);
}

它也工作得很好：

# cc is my model object

sum(unlist(lapply(slotNames(cc), function (x)  object.size(slot(cc, x)))))
# returns: [1] 2521256
save(cc, file="cc_before.RData")

.Call("R_remove_weights", cc@tree, TRUE, PACKAGE="party")
# returns NULL and removes weights and node statistics

sum(unlist(lapply(slotNames(cc), function (x)  object.size(slot(cc, x)))))
# returns: [1] 1521392
save(cc, file="cc_after.RData")

如您所见，它大大减少了对象大小，从大约 2.5MB 减少到 1.5MB。

但奇怪的是，相应的 RData 文件非常大，并且对它们没有影响：

$ ls -lh cc*
-rw-r--r-- 1 user user 9.6M Aug 24 15:44 cc_after.RData
-rw-r--r-- 1 user user 9.6M Aug 24 15:43 cc_before.RData

解压缩文件显示 2.5MB 对象占用了近 100MB 的空间：

$ cp cc_before.RData cc_before.gz
$ gunzip cc_before.gz 
$ ls -lh cc_before*
-rw-r--r-- 1 user user  98M Aug 24 15:45 cc_before

任何想法，什么可能导致这种情况？

原文

I created several ctree models (about 40 to 80) which I want evaluate rather often.

An issue is that the model objects are very big (40 models require more than 2.8G of memory) and it appears to me, that they stored the training data, maybe as modelname@data and modelname@responses, and not just the informations relevant to predict new data.

Most other R learning packages have configurable options whether to include the data in the model object, but I couldn't find any hints in the documentation. I also tried to assign empty ModelEnv objects by

modelname@data <- new("ModelEnv")

but there was no effect on the size of the respective RData file.

Anyone knows whether ctree really stores the training data and how to remove all data from ctree models that are irrelevant for new predictions so that I can fit many of them in memory?

Thanks a lot,

Stefan

Thank you for your feedback, that was already very helpful.

I used dput and str to take a deeper look at the object and found that no training data is included in the model, but there is a responses slot, which seems to have the training labels and rownames. Anyways, I noticed that each node has a weight vector for each training sample. After a while of inspecting the code, I ended up googling a bit and found the following comment in the party NEWS log:

         CHANGES IN party VERSION 0.9-13 (2007-07-23)

o   update `mvt.f'

o   improve the memory footprint of RandomForest objects
    substancially (by removing the weights slots from each node).

It turns out, there is a C function in the party package to remove these weights called R_remove_weights with the following definition:

SEXP R_remove_weights(SEXP subtree, SEXP removestats) {
    C_remove_weights(subtree, LOGICAL(removestats)[0]);
    return(R_NilValue);
}

It also works fine:

# cc is my model object

sum(unlist(lapply(slotNames(cc), function (x)  object.size(slot(cc, x)))))
# returns: [1] 2521256
save(cc, file="cc_before.RData")

.Call("R_remove_weights", cc@tree, TRUE, PACKAGE="party")
# returns NULL and removes weights and node statistics

sum(unlist(lapply(slotNames(cc), function (x)  object.size(slot(cc, x)))))
# returns: [1] 1521392
save(cc, file="cc_after.RData")

As you can see, it reduces the object size substantially, from roughly 2.5MB to 1.5MB.

What is strange, though, is that the corresponding RData files are insanely huge, and there is no impact on them:

$ ls -lh cc*
-rw-r--r-- 1 user user 9.6M Aug 24 15:44 cc_after.RData
-rw-r--r-- 1 user user 9.6M Aug 24 15:43 cc_before.RData

Unzipping the file shows the 2.5MB object to occupy nearly 100MB of space:

$ cp cc_before.RData cc_before.gz
$ gunzip cc_before.gz 
$ ls -lh cc_before*
-rw-r--r-- 1 user user  98M Aug 24 15:45 cc_before

Any ideas, what could cause this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

_蜘蛛 2024-12-08 05:13:03

我找到了解决当前问题的方法，因此如果有人可能遇到同样的问题，我会写下这个答案。我将描述我的过程，所以可能有点漫无目的，所以请耐心等待。

在没有任何线索的情况下，我考虑对插槽进行核攻击并删除权重，以使对象尽可能小，并至少节省一些内存，以防找不到修复方法。因此，我删除了 @data 和 @responses 作为开始，没有它们，预测仍然很好，但对 .RData 文件大小没有影响。

我反其道而行之，创建并清空 ctree 模型，只需将树插入其中：

> library(party)

## create reference predictions for the dataset
> predictions.org <- treeresponse(c1, d)

## save tree object for reference
save(c1, "testSize_c1.RData")

检查原始对象的大小：

$ ls -lh testSize_c1.RData 
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:35 testSize_c1.RData

现在，让我们创建一个空的 CTree 并仅复制树：

## extract the tree only 
> c1Tree <- c1@tree

## create empty tree and plug in the extracted one 
> newCTree <- new("BinaryTree")
> newCTree@tree <- c1Tree

## save tree for reference 
save(newCTree, file="testSize_newCTree.RData")

这个新的树对象现在小得多：

$ ls -lh testSize_newCTree.RData 
-rw-r--r-- 1 user user 108K 2011-08-25 14:35 testSize_newCTree.RData

但是，它不能用于预测：

## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)
Error in object@cond_distr_response(newdata = newdata, ...) : 
  unused argument(s) (newdata = newdata)

我们没有设置@cond_distr_response，这可能会导致错误，因此也复制原始的并再次尝试预测：

## extract cond_distr_response from original tree
> cdr <- c1@cond_distr_response
> newCTree@cond_distr_response <- cdr

## save tree for reference 
save(newCTree, file="testSize_newCTree_with_cdr.RData")

## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)

## check correctness
> identical(predictions.org, predictions.new)
[1] TRUE

这非常有效，但现在的尺寸RData 文件的值又回到了它的原始值：

$ ls -lh testSize_newCTree_with_cdr.RData 
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:37 testSize_newCTree_with_cdr.RData

简单地打印槽，表明它是绑定到环境的函数：

> c1@cond_distr_response
function (newdata = NULL, mincriterion = 0, ...) 
{
    wh <- RET@get_where(newdata = newdata, mincriterion = mincriterion)
    response <- object@responses
    if (any(response@is_censored)) {
        swh <- sort(unique(wh))
        RET <- vector(mode = "list", length = length(wh))
        resp <- response@variables[[1]]
        for (i in 1:length(swh)) {
            w <- weights * (where == swh[i])
            RET[wh == swh[i]] <- list(mysurvfit(resp, weights = w))
        }
        return(RET)
    }
    RET <- .Call("R_getpredictions", tree, wh, PACKAGE = "party")
    return(RET)
}
<environment: 0x44e8090>

因此，最初问题的答案似乎是对象的方法将环境绑定到它，这然后与对象一起保存在相应的 RData 文件中。这也可以解释为什么读取 RData 文件时会加载多个包。

因此，要摆脱环境，我们不能复制方法，但没有它们我们也无法预测。相当“肮脏”的解决方案是模拟原始方法的功能并直接调用底层C代码。经过深入研究源代码，这确实是可能的。正如上面复制的代码所示，我们需要调用 get_where，它确定输入到达的树的终端节点。然后，我们需要调用 R_getpredictions 来确定每个输入样本的终端节点的响应。棘手的部分是我们需要以正确的输入格式获取数据，因此必须调用 ctree 中包含的数据预处理：

## create a character string of the formula which was used to fit the free
## (there might be a more neat way to do this)
> library(stringr)
> org.formula <- str_c(
                   do.call(str_c, as.list(deparse(c1@data@formula$response[[2]]))),
                   "~", 
                   do.call(str_c, as.list(deparse(c1@data@formula$input[[2]]))))

## call the internal ctree preprocessing 
> data.dpp <- party:::ctreedpp(as.formula(org.formula), d)

## create the data object necessary for the ctree C code
> data.ivf <- party:::initVariableFrame.df(data.dpp@menv@get("input"), 
                                           trafo = ptrafo)

## now call the tree traversal routine, note that it only requires the tree
## extracted from the @tree slot, not the whole object
> nodeID <- .Call("R_get_nodeID", c1Tree, data.ivf, 0, PACKAGE = "party")

## now determine the respective responses
> predictions.syn <- .Call("R_getpredictions", c1Tree, nodeID, PACKAGE = "party")

## check correctness
> identical(predictions.org, predictions.syn)
[1] TRUE

我们现在只需要保存提取的树和公式字符串就能够预测新数据：

> save(c1Tree, org.formula, file="testSize_extractedObjects.RData")

我们可以进一步删除不必要的权重，如上面更新的问题中所述：

> .Call("R_remove_weights", c1Tree, TRUE, PACKAGE="party")
> save(c1Tree, org.formula, file="testSize_extractedObjects__removedWeights.RData")

现在让我们再次看看文件大小：

$ ls -lh testSize_extractedObjects*
-rw-r--r-- 1 user user 109K 2011-08-25 15:31 testSize_extractedObjects.RData
-rw-r--r-- 1 user user  43K 2011-08-25 15:31 testSize_extractedObjects__removedWeights.RData

最后，使用该模型只需要 43K，而不是（压缩的）9.6M。我现在应该能够在 3G 堆空间中容纳任意数量的内存。万岁！

I found a solution to the problem at hand, so I write this answer if anyone might run into the same issue. I'll describe my process, so it might be a bit rambling, so bear with me.

With no clue, I thought about nuking slots and removing weights to get the objects as small as possible and at least save some memory, in case no fix will be found. So I removed @data and @responses as a start and prediction went still fine without them, yet no effect on the .RData file size.

I the went the other way round and created and empty ctree model, just pluging the tree into it:

> library(party)

## create reference predictions for the dataset
> predictions.org <- treeresponse(c1, d)

## save tree object for reference
save(c1, "testSize_c1.RData")

Checking the size of the original object:

$ ls -lh testSize_c1.RData 
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:35 testSize_c1.RData

Now, let's create an empty CTree and copy the tree only:

## extract the tree only 
> c1Tree <- c1@tree

## create empty tree and plug in the extracted one 
> newCTree <- new("BinaryTree")
> newCTree@tree <- c1Tree

## save tree for reference 
save(newCTree, file="testSize_newCTree.RData")

This new tree object is now much smaller:

$ ls -lh testSize_newCTree.RData 
-rw-r--r-- 1 user user 108K 2011-08-25 14:35 testSize_newCTree.RData

However, it can't be used to predict:

## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)
Error in object@cond_distr_response(newdata = newdata, ...) : 
  unused argument(s) (newdata = newdata)

We did not set the @cond_distr_response, which might cause the error, so copy the original one as well and try to predict again:

## extract cond_distr_response from original tree
> cdr <- c1@cond_distr_response
> newCTree@cond_distr_response <- cdr

## save tree for reference 
save(newCTree, file="testSize_newCTree_with_cdr.RData")

## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)

## check correctness
> identical(predictions.org, predictions.new)
[1] TRUE

This works perfectly, but now the size of the RData file is back at its original value:

$ ls -lh testSize_newCTree_with_cdr.RData 
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:37 testSize_newCTree_with_cdr.RData

Simply printing the slot, shows it to be a function bound to an environment:

> c1@cond_distr_response
function (newdata = NULL, mincriterion = 0, ...) 
{
    wh <- RET@get_where(newdata = newdata, mincriterion = mincriterion)
    response <- object@responses
    if (any(response@is_censored)) {
        swh <- sort(unique(wh))
        RET <- vector(mode = "list", length = length(wh))
        resp <- response@variables[[1]]
        for (i in 1:length(swh)) {
            w <- weights * (where == swh[i])
            RET[wh == swh[i]] <- list(mysurvfit(resp, weights = w))
        }
        return(RET)
    }
    RET <- .Call("R_getpredictions", tree, wh, PACKAGE = "party")
    return(RET)
}
<environment: 0x44e8090>

So the answer to the initial question appears to be that the methods of the object bind an environment to it, which is then saved with the object in the corresponding RData file. This might also explain why several packages are loaded when the RData file is read.

Thus, to get rid of the environment, we can't copy the methods, but we can't predict without them either. The rather "dirty" solution is to emulate the functionality of the original methods and call the underlying C code directly. After some digging through the source code, this is indeed possible. As the code copied above suggests, we need to call get_where, which determines the terminal node of the tree reached by the input. We then need to call R_getpredictions to determine the response from that terminal node for each input sample. The tricky part is that we need to get the data in the right input format and thus have to call the data preprocessing included in ctree:

## create a character string of the formula which was used to fit the free
## (there might be a more neat way to do this)
> library(stringr)
> org.formula <- str_c(
                   do.call(str_c, as.list(deparse(c1@data@formula$response[[2]]))),
                   "~", 
                   do.call(str_c, as.list(deparse(c1@data@formula$input[[2]]))))

## call the internal ctree preprocessing 
> data.dpp <- party:::ctreedpp(as.formula(org.formula), d)

## create the data object necessary for the ctree C code
> data.ivf <- party:::initVariableFrame.df(data.dpp@menv@get("input"), 
                                           trafo = ptrafo)

## now call the tree traversal routine, note that it only requires the tree
## extracted from the @tree slot, not the whole object
> nodeID <- .Call("R_get_nodeID", c1Tree, data.ivf, 0, PACKAGE = "party")

## now determine the respective responses
> predictions.syn <- .Call("R_getpredictions", c1Tree, nodeID, PACKAGE = "party")

## check correctness
> identical(predictions.org, predictions.syn)
[1] TRUE

We now only need to save the extracted tree and the formula string to be able to predict new data:

> save(c1Tree, org.formula, file="testSize_extractedObjects.RData")

We can further remove the unnecessary weights as described in the updated question above:

> .Call("R_remove_weights", c1Tree, TRUE, PACKAGE="party")
> save(c1Tree, org.formula, file="testSize_extractedObjects__removedWeights.RData")

Now let's have a look at the file sizes again:

$ ls -lh testSize_extractedObjects*
-rw-r--r-- 1 user user 109K 2011-08-25 15:31 testSize_extractedObjects.RData
-rw-r--r-- 1 user user  43K 2011-08-25 15:31 testSize_extractedObjects__removedWeights.RData

Finally, instead of (compressed) 9.6M, only 43K are required to use the model. I should now be able to fit as many as I want in my 3G heap space. Hooray!

回复收藏 0 原文

婴鹅 2024-12-08 05:13:03

您要寻找的是删除插槽。需要注意的是：考虑到 party 函数如何与对象一起工作，这可能相当危险。

尽管如此，请查看 slotNames(yourModel)。您还可以尝试 object.size(slot(yourModel), slotNameOfInterest) 来检查不同插槽的大小。您可以轻松创建一个排序表来确定每个槽中对象的大小。

无论如何，data 的槽是一个 ModelEnvFormula（我将其称为“MEF”）对象。您可以创建一个虚拟 MEF：dummyMEF <- ModelEnvFormula(1 ~ 1)，然后将其分配给data：slot(yourModel, "data") < ;- 虚拟MEF。

这将摧毁那个特定的插槽。您应该查看一下是否还有其他插槽在存储方面造成了麻烦 - object.size() 函数将提供帮助。我同意能够从模型对象中省略训练数据是件好事。