是否存在与“anova”等效的词? (对于 lm)到 rpart 对象?

发布于 2024-08-24 17:55:11 字数 666 浏览 5 评论 0 原文

当使用 R 的 rpart 函数时,我可以轻松地用它拟合模型。例如:

# Classification Tree with rpart
library(rpart)

# grow tree 
fit <- rpart(Kyphosis ~ Age + Number + Start,
     method="class", data=kyphosis)

printcp(fit) # display the results 
plotcp(fit) 
summary(fit) # detailed summary of splits

# plot tree 
plot(fit, uniform=TRUE, 
     main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

我的问题是 - 如何衡量三个解释变量(年龄、人数、开始)对模型的“重要性”?

如果这是一个回归模型,我可以查看“anova”F 检验中的 p 值(在有变量和没有变量的 lm 模型之间)。但是,在 lm 上使用“anova”与 rpart 对象的等价性是什么?

(我希望我能弄清楚我的问题)

谢谢。

When using R's rpart function, I can easily fit a model with it. for example:

# Classification Tree with rpart
library(rpart)

# grow tree 
fit <- rpart(Kyphosis ~ Age + Number + Start,
     method="class", data=kyphosis)

printcp(fit) # display the results 
plotcp(fit) 
summary(fit) # detailed summary of splits

# plot tree 
plot(fit, uniform=TRUE, 
     main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

My question is -
How can I measure the "importance" of each of my three explanatory variables (Age, Number, Start) to the model?

If this was a regression model, I could have looked at p-values from the "anova" F-test (between lm models with and without the variable). But what is the equivalence of using "anova" on lm to an rpart object?

(I hope I managed to make my question clear)

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

下壹個目標 2024-08-31 17:55:11

当然,方差分析是不可能的,因为方差分析涉及计算响应变量的总变异并将其划分为信息成分(SSA、SSE)。我不明白如何计算像 Kyphosis 这样的分类变量的平方和。

我认为你实际上谈论的是属性选择(或评估)。例如,我会使用信息增益度量。我认为这就是用来选择树中每个节点的测试属性的方法,并且选择具有最高信息增益(或最大熵减少)的属性作为当前节点的测试属性。此属性最大限度地减少了对结果分区中的样本进行分类所需的信息。

我不知道R中是否有根据信息增益对属性进行排名的方法,但我知道WEKA 并命名为 InfoGainAttributeEval 它通过测量相对于类的信息增益来评估属性的价值。如果您使用 Ranker 作为搜索方法,则属性将根据其各自的评估进行排名。

编辑
我终于找到了一种在 R 中使用 Library CORElearn 来做到这一点的方法

estInfGain <- attrEval(Kyphosis ~ ., kyphosis, estimator="InfGain")
print(estInfGain)

Of course anova would be impossible, as anova involves calculating the total variation in the response variable and partitioning it into informative components (SSA, SSE). I can't see how one could calculate sum of squares for a categorical variable like Kyphosis.

I think that what you actually talking about is Attribute Selection (or evaluation). I would use the information gain measure for example. I think that this is what is used to select the test attribute at each node in the tree and the attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions.

I am not aware whether there is a method of ranking attributes according to their information gain in R, but I know that there is in WEKA and is named InfoGainAttributeEval It evaluates the worth of an attribute by measuring the information gain with respect to the class. And if you use Ranker as the Search Method, the attributes are ranked by their individual evaluations.

EDIT
I finally found a way to do this in R using Library CORElearn

estInfGain <- attrEval(Kyphosis ~ ., kyphosis, estimator="InfGain")
print(estInfGain)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文