访问 randomForest 中的单个叶子
我使用 R 中基于 randomForest 的包 quantregForest 从一组预测变量生成预测区间。
在一些数据上训练算法后,它会为测试数据中的每组预测变量输出基于分位数的预测区间。据我了解,生成的随机森林中的每个叶子(或终端节点)代表值的分布。如何访问构成森林中每个叶子(终端节点)的值?
I'm using the package quantregForest in R, which is based on randomForest, to generate forecast intervals from a set of predictors.
After training the algorithm on some data, it outputs a quantile-based prediction interval for each set of predictors in the test data. As I understand, each leaf (or terminal node) in the random forest which is generated, represents a distribution of values. How can I access the values which make up each of the leaves (terminal nodes) in the forest?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
据我所知,您目前正在使用基于 R 的 quantregForest 包。我不太熟悉这个包,但我会用 quantile 来回答你的问题-forest 包,它是分位数回归森林的基于 Python 的类似实现。你也许能够用 Python 产生你想要的结果;如果没有,这里讨论的概念可能会转化为 quantregForest 实现。分位数回归森林必须将训练响应值(或其映射)存储在叶节点中,因此从概念上讲应该可以在任何规范实现中检索这些值。我将在答案的末尾推测如何使用 quantregForest 包来实现这一点。
从分位数森林实现中提取叶值
从分位数森林包的 v1 开始,训练样本响应 (y) 值存储在
model.forest_.y_train
列表对象和映射叶节点的训练样本索引存储在model.forest_.y_train_leaves
对象,它是一个 3 维矩阵/形状数组(n_estimators、max_n_leaves、max_n_leaf_samples)。训练映射使用 1 索引值 (与原来使用的 0 索引相反通过Python),以便该对象可以存储为稀疏数组(0代表未使用的元素,而不是第一个训练样本)。总之,为了检索组成叶子的值,需要访问映射对象中的叶子索引,从索引中减去 1,并使用所得的非负值作为存储的响应值的索引。使用分位数森林实现的代码示例
下面是一个将这些详细信息放在一起的示例,以便访问特定叶子中的值:
上面的示例显示了如何访问单个叶子节点。您可以循环遍历每棵树的每个节点,以便访问整个集合中的叶子值。所以从上面的例子继续:
注意上面的循环将包括非终端节点;叶节点将是那些具有非空列表的节点。
也就是说,根据您的期望目标,可能还有更多便利功能可以提供帮助。例如,如果您想查找哪些样本共享叶节点(称为 proximities),该包有一个
proximity_counts
函数可以做到这一点。以下是使用该函数获取与第一个测试样本共享叶子的每个训练样本的值的示例:例如,可以使用该函数获取用于计算分位数的响应值特定样本或计算样本对位于同一叶节点的次数。
将上述概念应用于 quantregForest 实现
我对 quantregForest 包不太熟悉,但简单浏览一下代码就会发现与上面的代码有相似之处。
y_train_leaves
对象的推论似乎是valuesNodes
。但是,值得注意的是,它似乎直接存储响应值(而不是映射到单独的值列表),并且每个叶节点似乎仅存储 1 个值。不过,考虑到这些注意事项,您应该能够使用此对象来检索构成每个叶节点的值。I understand that you're using the R-based quantregForest package at the moment. I'm not well-versed with this package, but I'll provide an answer to your question with the quantile-forest package, which is a comparable Python-based implementation of Quantile Regression Forests. You may be able to produce your desired outcome in Python; if not, concepts discussed here may translate to the quantregForest implementation. A quantile regression forest must store the training response values (or a mapping thereof) in the leaf nodes, so it should be conceptually possible to retrieve the values in any canonical implementation. I'll speculate on how this might be accomplished with the quantregForest package at the end of my answer.
Extracting Leaf Values from quantile-forest Implementation
As of v1 of the quantile-forest package, the training sample response (y) values are stored in a
model.forest_.y_train
list object, and a mapping of training sample indices to leaf nodes is stored in amodel.forest_.y_train_leaves
object, which is a 3-dimensional matrix/array of shape (n_estimators, max_n_leaves, max_n_leaf_samples). The training mapping uses 1-indexed values (as opposed to the original 0-indexing used by Python) so that the object can be stored as a sparse array (with 0 representing unused elements, rather than the first training sample). Altogether, then, to retrieve the values that make up a leaf, one needs to access a leaf index in the mapping object, subtract 1 from the index, and use the resulting non-negative values as indices to the stored response values.Code Examples Using quantile-forest Implementation
Here's an example that puts these details together in order to access the values in a particular leaf:
The above example shows how to access an individual leaf node. You could loop over each node for each tree in order to access the leaf values across the full ensemble. So continuing from the above example:
Note that the above looping will include non-terminal nodes; leaf nodes will be those nodes with non-empty lists.
That said, depending on your desired goal here, there may be further convenience functions that can help. For example, if you want to find which samples share leaf nodes (known as proximities), the package has a
proximity_counts
function that can do this. Here's an example of using that function to get the values of every training sample that shares a leaf with the first test sample:This function could be used, for example, to get the response values that are used to calculate the quantile(s) for particular samples or to count the number of times that pairs of samples reside in the same leaf node.
Applying the Above Concepts to quantregForest Implementation
I'm not intimately familiar with the quantregForest package, but a brief look at the code suggests similarities to the above. The corollary to the
y_train_leaves
object appears to bevaluesNodes
. However, it's worth noting that it appears to store the response values directly (rather than a mapping to a separate list of values) and only appears to store 1 value per leaf node. Given these caveats, though, you should be able to use this object to retrieve the values that make up each of the leaf nodes.