如何从 party:::ctree 模型中删除训练数据?
我创建了几个 ctree 模型(大约 40 到 80 个),我想经常对其进行评估。
一个问题是模型对象非常大(40 个模型需要超过 2.8G 的内存),在我看来,它们存储了训练数据,可能作为 modelname@data 和 modelname@responses,而不仅仅是相关信息来预测新数据。
大多数其他 R 学习包都有是否将数据包含在模型对象中的可配置选项,但我在文档中找不到任何提示。我还尝试分配空的 ModelEnv 对象,
modelname@data <- new("ModelEnv")
但对相应 RData 文件的大小没有影响。
任何人都知道 ctree 是否真的存储训练数据以及如何从 ctree 模型中删除与新预测无关的所有数据,以便我可以将其中的许多数据放入内存中?
非常感谢,
Stefan,
谢谢您的反馈,这已经非常有帮助了。
我使用 dput
和 str
深入查看该对象,发现模型中不包含任何训练数据,但有一个 responses
插槽,其中似乎有训练标签和行名称。无论如何,我注意到每个节点都有每个训练样本的权重向量。检查代码一段时间后,我最终用谷歌搜索了一下,在 party
NEWS 日志中发现了以下注释:
CHANGES IN party VERSION 0.9-13 (2007-07-23)
o update `mvt.f'
o improve the memory footprint of RandomForest objects
substancially (by removing the weights slots from each node).
事实证明,party 包中有一个 C 函数可以删除这些权重,称为R_remove_weights
具有以下定义:
SEXP R_remove_weights(SEXP subtree, SEXP removestats) {
C_remove_weights(subtree, LOGICAL(removestats)[0]);
return(R_NilValue);
}
它也工作得很好:
# cc is my model object
sum(unlist(lapply(slotNames(cc), function (x) object.size(slot(cc, x)))))
# returns: [1] 2521256
save(cc, file="cc_before.RData")
.Call("R_remove_weights", cc@tree, TRUE, PACKAGE="party")
# returns NULL and removes weights and node statistics
sum(unlist(lapply(slotNames(cc), function (x) object.size(slot(cc, x)))))
# returns: [1] 1521392
save(cc, file="cc_after.RData")
如您所见,它大大减少了对象大小,从大约 2.5MB 减少到 1.5MB。
但奇怪的是,相应的 RData 文件非常大,并且对它们没有影响:
$ ls -lh cc*
-rw-r--r-- 1 user user 9.6M Aug 24 15:44 cc_after.RData
-rw-r--r-- 1 user user 9.6M Aug 24 15:43 cc_before.RData
解压缩文件显示 2.5MB 对象占用了近 100MB 的空间:
$ cp cc_before.RData cc_before.gz
$ gunzip cc_before.gz
$ ls -lh cc_before*
-rw-r--r-- 1 user user 98M Aug 24 15:45 cc_before
任何想法,什么可能导致这种情况?
I created several ctree models (about 40 to 80) which I want evaluate rather often.
An issue is that the model objects are very big (40 models require more than 2.8G of memory) and it appears to me, that they stored the training data, maybe as modelname@data and modelname@responses, and not just the informations relevant to predict new data.
Most other R learning packages have configurable options whether to include the data in the model object, but I couldn't find any hints in the documentation. I also tried to assign empty ModelEnv objects by
modelname@data <- new("ModelEnv")
but there was no effect on the size of the respective RData file.
Anyone knows whether ctree really stores the training data and how to remove all data from ctree models that are irrelevant for new predictions so that I can fit many of them in memory?
Thanks a lot,
Stefan
Thank you for your feedback, that was already very helpful.
I used dput
and str
to take a deeper look at the object and found that no training data is included in the model, but there is a responses
slot, which seems to have the training labels and rownames. Anyways, I noticed that each node has a weight vector for each training sample. After a while of inspecting the code, I ended up googling a bit and found the following comment in the party
NEWS log:
CHANGES IN party VERSION 0.9-13 (2007-07-23)
o update `mvt.f'
o improve the memory footprint of RandomForest objects
substancially (by removing the weights slots from each node).
It turns out, there is a C function in the party package to remove these weights called R_remove_weights
with the following definition:
SEXP R_remove_weights(SEXP subtree, SEXP removestats) {
C_remove_weights(subtree, LOGICAL(removestats)[0]);
return(R_NilValue);
}
It also works fine:
# cc is my model object
sum(unlist(lapply(slotNames(cc), function (x) object.size(slot(cc, x)))))
# returns: [1] 2521256
save(cc, file="cc_before.RData")
.Call("R_remove_weights", cc@tree, TRUE, PACKAGE="party")
# returns NULL and removes weights and node statistics
sum(unlist(lapply(slotNames(cc), function (x) object.size(slot(cc, x)))))
# returns: [1] 1521392
save(cc, file="cc_after.RData")
As you can see, it reduces the object size substantially, from roughly 2.5MB to 1.5MB.
What is strange, though, is that the corresponding RData files are insanely huge, and there is no impact on them:
$ ls -lh cc*
-rw-r--r-- 1 user user 9.6M Aug 24 15:44 cc_after.RData
-rw-r--r-- 1 user user 9.6M Aug 24 15:43 cc_before.RData
Unzipping the file shows the 2.5MB object to occupy nearly 100MB of space:
$ cp cc_before.RData cc_before.gz
$ gunzip cc_before.gz
$ ls -lh cc_before*
-rw-r--r-- 1 user user 98M Aug 24 15:45 cc_before
Any ideas, what could cause this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我找到了解决当前问题的方法,因此如果有人可能遇到同样的问题,我会写下这个答案。我将描述我的过程,所以可能有点漫无目的,所以请耐心等待。
在没有任何线索的情况下,我考虑对插槽进行核攻击并删除权重,以使对象尽可能小,并至少节省一些内存,以防找不到修复方法。因此,我删除了
@data
和@responses
作为开始,没有它们,预测仍然很好,但对 .RData 文件大小没有影响。我反其道而行之,创建并清空 ctree 模型,只需将树插入其中:
检查原始对象的大小:
现在,让我们创建一个空的 CTree 并仅复制树:
这个新的树对象现在小得多:
但是,它不能用于预测:
我们没有设置@cond_distr_response,这可能会导致错误,因此也复制原始的并再次尝试预测:
这非常有效,但现在的尺寸RData 文件的值又回到了它的原始值:
简单地打印槽,表明它是绑定到环境的函数:
因此,最初问题的答案似乎是对象的方法将环境绑定到它,这然后与对象一起保存在相应的 RData 文件中。这也可以解释为什么读取 RData 文件时会加载多个包。
因此,要摆脱环境,我们不能复制方法,但没有它们我们也无法预测。相当“肮脏”的解决方案是模拟原始方法的功能并直接调用底层C代码。经过深入研究源代码,这确实是可能的。正如上面复制的代码所示,我们需要调用
get_where
,它确定输入到达的树的终端节点。然后,我们需要调用 R_getpredictions 来确定每个输入样本的终端节点的响应。棘手的部分是我们需要以正确的输入格式获取数据,因此必须调用 ctree 中包含的数据预处理:我们现在只需要保存提取的树和公式字符串就能够预测新数据:
我们可以进一步删除不必要的权重,如上面更新的问题中所述:
现在让我们再次看看文件大小:
最后,使用该模型只需要 43K,而不是(压缩的)9.6M。我现在应该能够在 3G 堆空间中容纳任意数量的内存。万岁!
I found a solution to the problem at hand, so I write this answer if anyone might run into the same issue. I'll describe my process, so it might be a bit rambling, so bear with me.
With no clue, I thought about nuking slots and removing weights to get the objects as small as possible and at least save some memory, in case no fix will be found. So I removed
@data
and@responses
as a start and prediction went still fine without them, yet no effect on the .RData file size.I the went the other way round and created and empty ctree model, just pluging the tree into it:
Checking the size of the original object:
Now, let's create an empty CTree and copy the tree only:
This new tree object is now much smaller:
However, it can't be used to predict:
We did not set the
@cond_distr_response
, which might cause the error, so copy the original one as well and try to predict again:This works perfectly, but now the size of the RData file is back at its original value:
Simply printing the slot, shows it to be a function bound to an environment:
So the answer to the initial question appears to be that the methods of the object bind an environment to it, which is then saved with the object in the corresponding RData file. This might also explain why several packages are loaded when the RData file is read.
Thus, to get rid of the environment, we can't copy the methods, but we can't predict without them either. The rather "dirty" solution is to emulate the functionality of the original methods and call the underlying C code directly. After some digging through the source code, this is indeed possible. As the code copied above suggests, we need to call
get_where
, which determines the terminal node of the tree reached by the input. We then need to callR_getpredictions
to determine the response from that terminal node for each input sample. The tricky part is that we need to get the data in the right input format and thus have to call the data preprocessing included in ctree:We now only need to save the extracted tree and the formula string to be able to predict new data:
We can further remove the unnecessary weights as described in the updated question above:
Now let's have a look at the file sizes again:
Finally, instead of (compressed) 9.6M, only 43K are required to use the model. I should now be able to fit as many as I want in my 3G heap space. Hooray!
您要寻找的是删除插槽。需要注意的是:考虑到 party 函数如何与对象一起工作,这可能相当危险。
尽管如此,请查看
slotNames(yourModel)
。您还可以尝试object.size(slot(yourModel), slotNameOfInterest)
来检查不同插槽的大小。您可以轻松创建一个排序表来确定每个槽中对象的大小。无论如何,
data
的槽是一个ModelEnvFormula
(我将其称为“MEF”)对象。您可以创建一个虚拟 MEF:dummyMEF <- ModelEnvFormula(1 ~ 1)
,然后将其分配给data
:slot(yourModel, "data") < ;- 虚拟MEF
。这将摧毁那个特定的插槽。您应该查看一下是否还有其他插槽在存储方面造成了麻烦 -
object.size()
函数将提供帮助。我同意能够从模型对象中省略训练数据是件好事。What you're looking for is to remove slots. A word of caution: this could be rather dangerous given how
party
functions work with the object.Nonetheless, take a look at
slotNames(yourModel)
. You can also tryobject.size(slot(yourModel), slotNameOfInterest)
to examine the size of different slots. You could easily create a sorted table to be sure of the sizes of objects in each slot.In any case, the slot for
data
is aModelEnvFormula
(I'll call this "MEF") object. You could create a dummy MEF:dummyMEF <- ModelEnvFormula(1 ~ 1)
and then assign that todata
:slot(yourModel, "data") <- dummyMEF
.That will nuke that particular slot. You should take a look to see if there are other slots that are causing headaches in terms of the storage - the
object.size()
function will assist. I agree that it's nice to be able to omit training data from the model object.