R 随机森林变量重要性
我正在尝试使用随机森林包在 R 中进行分类。
列出的变量重要性度量为:
- 类别 0 的变量 x 的平均原始重要性
- 得分 类别 1 的变量 x 的平均原始重要性得分
MeanDecreaseAccuracy
- < code>MeanDecreaseGini
现在我知道这些“意思”是什么,因为我知道它们的定义。 我想知道的是如何使用它们。
我真正想知道的是这些值仅在它们有多准确的情况下意味着什么,什么是好的值,什么是坏的值,最大值和最小值是多少等等。
如果一个变量具有高MeanDecreaseAccuracy
或 MeanDecreaseGini
是否意味着它重要或不重要? 此外,任何有关原始分数的信息也可能很有用。 我想知道与这些数字的应用相关的一切。
使用“错误”、“求和”或“排列”等词的解释不如不涉及任何关于随机森林如何工作的讨论的更简单的解释有帮助。
就像如果我想让有人向我解释如何使用收音机一样,我不会期望解释涉及收音机如何将无线电波转换为声音。
I am trying to use the random forests package for classification in R.
The Variable Importance Measures listed are:
- mean raw importance score of variable x for class 0
- mean raw importance score of variable x for class 1
MeanDecreaseAccuracy
MeanDecreaseGini
Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.
What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.
If a variable has a high MeanDecreaseAccuracy
or MeanDecreaseGini
does that mean it is important or unimportant? Also any information on raw scores could be useful too.
I want to know everything there is to know about these numbers that is relevant to the application of them.
An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果不深入讨论令人讨厌的波频率技术细节,您如何解释 WKRP 100.5 FM 中的数字“意味着什么”? 坦率地说,即使您了解一些技术术语,随机森林的参数和相关性能问题也很难理解。
这是我的一些答案:
从随机森林简化 网页,原始重要性得分衡量特定预测变量在成功分类数据方面比随机特定预测变量有多大帮助。
我认为这仅在 R 模块 中,并且我相信它衡量了模型中包含该预测变量可以减少分类错误的程度。
基尼 在用于描述社会的收入分配时被定义为“不平等”,或者基于树的分类中“节点杂质”的度量。 低基尼系数(即基尼系数下降幅度较大)意味着特定的预测变量在将数据划分为定义的类别时发挥更大的作用。 如果不讨论分类树中的数据根据预测变量的值在各个节点上分割这一事实,就很难描述这一点。 我不太清楚这如何转化为更好的性能。
How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.
Here's my shot at some answers:
Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.
I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.
Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.
对于您最关心的问题:值越高意味着变量越重要。 对于您提到的所有措施都应该如此。
随机森林为您提供了相当复杂的模型,因此解释重要性度量可能很棘手。 如果您想轻松了解变量的作用,请不要使用 RF。 请改用线性模型或(非集成)决策树。
你说:
除非您深入研究并了解随机森林,否则要解释比上述内容更多的内容将非常困难。 我假设您抱怨手册或 Breiman 手册中的部分:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
为了弄清楚变量的重要性,他们用随机垃圾填充它(“排列”它),然后看看预测准确性下降了多少。 MeanDecreaseAccuracy 和 MeanDecreaseGini 就是这样工作的。 我不确定原始重要性分数是多少。
For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention.
Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead.
You said:
It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.
随机森林的可解释性有点困难。 虽然 RF 是一个极其强大的分类器,但它以民主方式做出预测。 我的意思是,通过获取变量的随机子集和数据的随机子集并构建一棵树,您可以构建数百或数千棵树。 然后对所有未选择的数据进行预测并保存预测。 它很强大,因为它可以很好地处理数据集的变幻莫测(即它可以平滑随机的高/低值、偶然的图/样本、以 4 种不同的方式测量同一事物等)。 但是,如果您有一些高度相关的变量,则这两个变量可能看起来都很重要,因为它们并不总是都包含在每个模型中。
随机森林的一种潜在方法可能是帮助减少预测变量,然后切换到常规 CART 或尝试使用 PARTY 包来构建基于推理的树模型。 但是,您必须警惕数据挖掘问题以及对参数的推断。
Interpretability is kinda tough with Random Forests. While RF is an extremely robust classifier it makes its predictions democratically. By this I mean you build hundreds or thousands of trees by taking a random subset of your variables and a random subset of your data and build a tree. Then make a prediction for all the non-selected data and save the prediction. Its robust because it deals well with the vagaries of your data set, (ie it smooths over randomly high/low values, fortuitous plots/samples, measuring the same thing 4 different ways, etc). However if you have some highly correlated variables, both may seem important as they are not both always included in each model.
One potential approach with random forests may be to help whittle down your predictors then switch to regular CART or try the PARTY package for inference based tree models. However then you must be wary about data mining issues, and making inferences about parameters.