访问 randomForest 中的单个叶子

发布于 2025-01-14 21:30:13 字数 164 浏览 3 评论 0原文

我使用 R 中基于 randomForest 的包 quantregForest 从一组预测变量生成预测区间。

在一些数据上训练算法后，它会为测试数据中的每组预测变量输出基于分位数的预测区间。据我了解，生成的随机森林中的每个叶子（或终端节点）代表值的分布。如何访问构成森林中每个叶子（终端节点）的值？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谁人与我共长歌 2025-01-21 21:30:13

据我所知，您目前正在使用基于 R 的 quantregForest 包。我不太熟悉这个包，但我会用 quantile 来回答你的问题-forest 包，它是分位数回归森林的基于 Python 的类似实现。你也许能够用 Python 产生你想要的结果；如果没有，这里讨论的概念可能会转化为 quantregForest 实现。分位数回归森林必须将训练响应值（或其映射）存储在叶节点中，因此从概念上讲应该可以在任何规范实现中检索这些值。我将在答案的末尾推测如何使用 quantregForest 包来实现这一点。

从分位数森林实现中提取叶值

从分位数森林包的 v1 开始，训练样本响应 (y) 值存储在 model.forest_.y_train 列表对象和映射叶节点的训练样本索引存储在 model.forest_.y_train_leaves 对象，它是一个 3 维矩阵/形状数组（n_estimators、max_n_leaves、max_n_leaf_samples）。训练映射使用 1 索引值（与原来使用的 0 索引相反通过Python），以便该对象可以存储为稀疏数组（0代表未使用的元素，而不是第一个训练样本）。总之，为了检索组成叶子的值，需要访问映射对象中的叶子索引，从索引中减去 1，并使用所得的非负值作为存储的响应值的索引。

使用分位数森林实现的代码示例

下面是一个将这些详细信息放在一起的示例，以便访问特定叶子中的值：

import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

qrf = RandomForestQuantileRegressor(random_state=0)
qrf.fit(X_train, y_train)

# Get the training indices for tree=0, leaf=18683.
y_train_leaves = np.asarray(qrf.forest_.y_train_leaves)
train_indices = y_train_leaves[0, 18683, :] - 1
train_indices = train_indices[train_indices >= 0]

# Get the training response values for the training indices
print(np.array(qrf.forest_.y_train)[train_indices])

上面的示例显示了如何访问单个叶子节点。您可以循环遍历每棵树的每个节点，以便访问整个集合中的叶子值。所以从上面的例子继续：

n_trees, n_nodes, _ = y_train_leaves.shape
for tree_i in range(n_trees):
    for node_j in range(n_nodes):
        train_indices_ij = y_train_leaves[tree_i, node_j] - 1
        train_indices_ij = train_indices_ij[train_indices_ij >= 0]
        print(np.array(qrf.forest_.y_train)[train_indices_ij])

注意上面的循环将包括非终端节点；叶节点将是那些具有非空列表的节点。

也就是说，根据您的期望目标，可能还有更多便利功能可以提供帮助。例如，如果您想查找哪些样本共享叶节点（称为 proximities），该包有一个 proximity_counts 函数可以做到这一点。以下是使用该函数获取与第一个测试样本共享叶子的每个训练样本的值的示例：

proximities = qrf.proximity_counts(X_test)
prox_indices = np.array([x[0] for x in proximities[0]])
print(np.array(qrf.forest_.y_train)[prox_indices])

例如，可以使用该函数获取用于计算分位数的响应值特定样本或计算样本对位于同一叶节点的次数。

将上述概念应用于 quantregForest 实现

我对 quantregForest 包不太熟悉，但简单浏览一下代码就会发现与上面的代码有相似之处。 y_train_leaves 对象的推论似乎是 valuesNodes。但是，值得注意的是，它似乎直接存储响应值（而不是映射到单独的值列表），并且每个叶节点似乎仅存储 1 个值。不过，考虑到这些注意事项，您应该能够使用此对象来检索构成每个叶节点的值。

I understand that you're using the R-based quantregForest package at the moment. I'm not well-versed with this package, but I'll provide an answer to your question with the quantile-forest package, which is a comparable Python-based implementation of Quantile Regression Forests. You may be able to produce your desired outcome in Python; if not, concepts discussed here may translate to the quantregForest implementation. A quantile regression forest must store the training response values (or a mapping thereof) in the leaf nodes, so it should be conceptually possible to retrieve the values in any canonical implementation. I'll speculate on how this might be accomplished with the quantregForest package at the end of my answer.

Extracting Leaf Values from quantile-forest Implementation

As of v1 of the quantile-forest package, the training sample response (y) values are stored in a model.forest_.y_train list object, and a mapping of training sample indices to leaf nodes is stored in a model.forest_.y_train_leaves object, which is a 3-dimensional matrix/array of shape (n_estimators, max_n_leaves, max_n_leaf_samples). The training mapping uses 1-indexed values (as opposed to the original 0-indexing used by Python) so that the object can be stored as a sparse array (with 0 representing unused elements, rather than the first training sample). Altogether, then, to retrieve the values that make up a leaf, one needs to access a leaf index in the mapping object, subtract 1 from the index, and use the resulting non-negative values as indices to the stored response values.

Code Examples Using quantile-forest Implementation

Here's an example that puts these details together in order to access the values in a particular leaf:

import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

qrf = RandomForestQuantileRegressor(random_state=0)
qrf.fit(X_train, y_train)

# Get the training indices for tree=0, leaf=18683.
y_train_leaves = np.asarray(qrf.forest_.y_train_leaves)
train_indices = y_train_leaves[0, 18683, :] - 1
train_indices = train_indices[train_indices >= 0]

# Get the training response values for the training indices
print(np.array(qrf.forest_.y_train)[train_indices])

The above example shows how to access an individual leaf node. You could loop over each node for each tree in order to access the leaf values across the full ensemble. So continuing from the above example:

n_trees, n_nodes, _ = y_train_leaves.shape
for tree_i in range(n_trees):
    for node_j in range(n_nodes):
        train_indices_ij = y_train_leaves[tree_i, node_j] - 1
        train_indices_ij = train_indices_ij[train_indices_ij >= 0]
        print(np.array(qrf.forest_.y_train)[train_indices_ij])

Note that the above looping will include non-terminal nodes; leaf nodes will be those nodes with non-empty lists.

That said, depending on your desired goal here, there may be further convenience functions that can help. For example, if you want to find which samples share leaf nodes (known as proximities), the package has a proximity_counts function that can do this. Here's an example of using that function to get the values of every training sample that shares a leaf with the first test sample:

proximities = qrf.proximity_counts(X_test)
prox_indices = np.array([x[0] for x in proximities[0]])
print(np.array(qrf.forest_.y_train)[prox_indices])

This function could be used, for example, to get the response values that are used to calculate the quantile(s) for particular samples or to count the number of times that pairs of samples reside in the same leaf node.

Applying the Above Concepts to quantregForest Implementation

I'm not intimately familiar with the quantregForest package, but a brief look at the code suggests similarities to the above. The corollary to the y_train_leaves object appears to be valuesNodes. However, it's worth noting that it appears to store the response values directly (rather than a mapping to a separate list of values) and only appears to store 1 value per leaf node. Given these caveats, though, you should be able to use this object to retrieve the values that make up each of the leaf nodes.

回复收藏 0 原文

~没有更多了~