决策树拆分实现

发布于 2025-01-30 14:21:48 字数 982 浏览 4 评论 0原文

我是作为我大学作业的一部分这样做的,但是我在网上找不到有关如何正确实施此功能的资源。 我已经阅读了定义最佳集合拆分的指标材料(例如熵,Gini等),因此我了解如何选择功能的最佳值以将学习设置设置为左右节点。

但是,我完全没有得到实施的复杂性,考虑到我们还必须选择最佳功能,这意味着在每个节点上计算最佳值,它将使用o(n^2),这是糟糕的,考虑到实际ML数据集是很糟糕的形状约为10^2 x 10^6,这在计算成本方面确实很大。

我是否缺少在这里可以用来帮助降低复杂性的某种方法?

我目前有这个基线实现,可以选择最佳的功能和价值来分配,但我真的很想使它变得更好:

    for f_idx in range(X_subset.shape[1]):
        sorted_values = X_subset.iloc[:, f_idx].sort_values()
        for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
            y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
            if threshold is not None:
                G = calc_g(y_subset, y_left, y_right)
                if G < tr_G:
                    threshold = v
                    feature_idx = f_idx
                    tr_G = G
            else:
                threshold = v
                feature_idx = f_idx
                tr_G = G

    return feature_idx, threshold

I am doing this as a part of my university assignment, but I can't find any resources online on how to correctly implement this.
I have read tons materials on metrics that define optimal set split (like Entropy, Gini and others), so I understand how we would choose an optimal value of feature to split learning set into left and right nodes.

However what I totally don't get is the complexity of implementation, considering we also have to choose optimal feature, which means that on each node to compute optimal value it would take O(n^2), which is bad considering real ML datasets are shaped about 10^2 x 10^6, this is really big in terms of computation cost.

Am I missing some kind of approach that could be used here to help reduce complexity?

I currently have this baseline implementation for choosing best feature and value to split on, but I really want to make it better:

    for f_idx in range(X_subset.shape[1]):
        sorted_values = X_subset.iloc[:, f_idx].sort_values()
        for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
            y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
            if threshold is not None:
                G = calc_g(y_subset, y_left, y_right)
                if G < tr_G:
                    threshold = v
                    feature_idx = f_idx
                    tr_G = G
            else:
                threshold = v
                feature_idx = f_idx
                tr_G = G

    return feature_idx, threshold

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

甜尕妞 2025-02-06 14:21:48

因此,由于没有人回答,所以我发现了一些东西。

首先,是的,这项任务是非常计算的。但是,可以使用几种技巧来减少您需要执行的分裂量,以“种植树”。

这尤其重要,因为您真的不想要一棵巨大的过度树 - 它没有任何价值,更重要的是要获得弱模型,可以在某种陷入困境的teqnique中与他人一起使用。

至于正规化技巧,以下是我自己使用的几对:

  • 限制树的最大限制
  • 节点中最小的物品限制
  • ,在树中,最大的叶子
  • 后,拆分标准的最小值变化

限制了在执行最佳拆分 算法部分,有一种智能的方法可以建造树。如果您像我之前发布的代码一样执行此操作,则时间复杂性将在O(H * N^2 * D)附近,其中H为树的高度。为了解决此问题,有几种方法,我没有亲自编码,但请阅读:

  • 使用动态编程来累积每个功能的统计信息,因此您不必将它们重新计算为每个分式
  • 使用数据包机和存储桶排序o(n)分类

信息来源: https://ml handbook.ru/chapters/chapters/decision_tree/decision_tree /介绍
(使用Google Translate,因为网站位于俄语)

So, since no one answered, here some stuff I found out.

Firstly, yes, this task is very computationaly intensive. However, several tricks may be used to reduce amount of splits you need to perform to "grow a tree".

This is especially important, since you don't really want a giant overfitted tree - it just doesn't has any value, what it is more important is to get weak model, which can be used with others in some sort of ensmebling teqnique.

As for the regularization tricks, here are couple of I used myself:

  • limit the maximum depth of tree
  • limit the minimal amount of items in node
  • limit the maximimum amount of leafes in tree
  • limit the minimum quiality change in split criteria after performing an optimal split

For algorithmic part, there is a way to build a tree a smart way. If you do it as in the code I posted earlier, time complexity will be around O(h * N^2 * D), where h is height of the tree. To work around this, there are several approaches, which I didn't personally code, but read about:

  • Use dynamic programming for accumulating of statistics per feature, so you don't have to recalculate them every split
  • Use data binning and bucket sort for O(n) sorting

Source of info: https://ml-handbook.ru/chapters/decision_tree/intro
(use google translate, since website is in russian)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文