决策树拆分实现

发布于 2025-01-30 14:21:48 字数 982 浏览 4 评论 0原文

我是作为我大学作业的一部分这样做的，但是我在网上找不到有关如何正确实施此功能的资源。我已经阅读了定义最佳集合拆分的指标材料（例如熵，Gini等），因此我了解如何选择功能的最佳值以将学习设置设置为左右节点。

但是，我完全没有得到实施的复杂性，考虑到我们还必须选择最佳功能，这意味着在每个节点上计算最佳值，它将使用o（n^2），这是糟糕的，考虑到实际ML数据集是很糟糕的形状约为10^2 x 10^6，这在计算成本方面确实很大。

我是否缺少在这里可以用来帮助降低复杂性的某种方法？

我目前有这个基线实现，可以选择最佳的功能和价值来分配，但我真的很想使它变得更好：

    for f_idx in range(X_subset.shape[1]):
        sorted_values = X_subset.iloc[:, f_idx].sort_values()
        for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
            y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
            if threshold is not None:
                G = calc_g(y_subset, y_left, y_right)
                if G < tr_G:
                    threshold = v
                    feature_idx = f_idx
                    tr_G = G
            else:
                threshold = v
                feature_idx = f_idx
                tr_G = G

    return feature_idx, threshold

原文

I am doing this as a part of my university assignment, but I can't find any resources online on how to correctly implement this.
I have read tons materials on metrics that define optimal set split (like Entropy, Gini and others), so I understand how we would choose an optimal value of feature to split learning set into left and right nodes.

However what I totally don't get is the complexity of implementation, considering we also have to choose optimal feature, which means that on each node to compute optimal value it would take O(n^2), which is bad considering real ML datasets are shaped about 10^2 x 10^6, this is really big in terms of computation cost.

Am I missing some kind of approach that could be used here to help reduce complexity?

I currently have this baseline implementation for choosing best feature and value to split on, but I really want to make it better:

    for f_idx in range(X_subset.shape[1]):
        sorted_values = X_subset.iloc[:, f_idx].sort_values()
        for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
            y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
            if threshold is not None:
                G = calc_g(y_subset, y_left, y_right)
                if G < tr_G:
                    threshold = v
                    feature_idx = f_idx
                    tr_G = G
            else:
                threshold = v
                feature_idx = f_idx
                tr_G = G

    return feature_idx, threshold

分享到QQ

分享到微博