当前位置：文江博客话题详情

如何使用连续数据进行有效的随机森林？

发布于 2025-01-29 19:18:12 字数 2686 浏览 2 评论 0原文

我试图通过在Python中从头开始创建一个随机森林。

为此，我创建了倍数决策树，将参数吸收pandas dataframe，然后使用gini杂质作为指标来拆分我的节点。

我在这里面对连续数据，所以我有一个计算问题：该数据集包含虚拟成绩的学生，所有等级范围在不同的间隔之间（[-2; 2]，[0; 100]，...）。该数据集中有1600行，13个等级。

问题是，对于每一列，我检查了所有可能的拆分值，因此对于第一个节点，我将对13列中的每个节点检查1599个可能的值（每个值占两个邻居数据的平均值）。我这样做是为了找到最佳价值的最佳列。

这部分需要超过3秒钟的时间，因此一棵树具有多个层次，并且在随机的森林需要多棵树中，需要花费几个小时。

使用Scikit-Learn，我试图运行Random ForestClasifier，并以毫秒的方式运行，因此我知道可以使它变得更好。

我想念什么？任何指针都可以在我的实施方式上是可以的。

下面是我的算法，可以找到最佳的列/价值对以拆分我的树

def gini_df(df: pd.DataFrame, class_col: str) -> float:
    """
    Functiion returning the gini index of a sub dataframe, where and hypothesis was applied
    :param df: the subdataframe
    :param class_col: the column representing the class
    :return: the gini index, a float:[0.0; 1.0]
    """
    occurrences = df.groupby([class_col]).size()
    proportions = occurrences / df.shape[0]
    squared = proportions ** 2
    res = 1 - squared.sum()

    return res


def test_num_hypothesis(df: pd.DataFrame, col: str, value: float, res_col: str) -> float:
    """
    Tests a numerical hypothesis for a numerical variable, returning the total gini impurity
    :param df: the dataframe to test the hypothesis on
    :param col: the colum of the hypothesis
    :param value: the value of the hypothesis
    :param res_col: the column of the result classes
    :return: the total weighted gini impurity
    """
    left = df[df[col] < value]
    right = df[df[col] >= value]

    left_gini = gini_df(left, res_col)
    right_gini = gini_df(right, res_col)

    # TODO factorise the total gini impurity into one separate function
    total_impurity = len(left) / len(df) * left_gini + len(right) / len(df) * right_gini

    return total_impurity


def test_hypothesis(df: pd.DataFrame, col: str, res_col: str) -> tuple[float, float]:
    """
    test a hypothesis for a numerical variable, returning the total gini impurity
    :param df: the dataframe to test the hypothesis on
    :param col: the colum of the hypothesis
    :param res_col: the column of the result classes
    :return: a tuple, containing the best hypothesis for this column (Impurity, Hypothesis)
    """
    values: list = df.sort_values(col)[col].unique()

    best: tuple = None  # (impurity, hypothesis)

    t = time.time()

    for i, i_ in zip(values[:-1], values[1:]):  # going with elements 2 by 2 to mean them
        hypothesis = (i + i_) / 2.0
        h_impurity = test_num_hypothesis(df, col, hypothesis, res_col)
        if best is None or h_impurity < best[0]:
            best = (h_impurity, hypothesis)
    return best

原文

I am trying to learn about Random Forests by creating one from scratch in Python.

To do so, I create multiples Decision Trees, taking in parameter a pandas DataFrame, and using the gini impurity as the metric to split my nodes.

I am here facing continuous data, and so I am having a computation problem:
This dataset contains fictive grades of students, all grades ranging between different intervals ([-2;2], [0;100], ...). There is 1600 lines in this dataset, and 13 grades.

The problem is that for every column, I check every possible split value, so for the first node, I will check 1599 possible values (each values beeing the mean of two neighbor data) for each of the 13 columns. I do so to find the best column with the best possible value.

This part takes over 3 seconds, for 1 split, so a Tree having multiple levels, and in a Random Forest needind multiple trees, it'd take hours.

Using scikit-learn, I tried to run the RandomForestClasifier, and it ran in a manner of milliseconds, so I know it is possible to make it way better.

What am I missing ? Any pointers would be fine in how my implementetion is wrong.

Below is my algorithm to find the best column/value pair to split my tree

def gini_df(df: pd.DataFrame, class_col: str) -> float:
    """
    Functiion returning the gini index of a sub dataframe, where and hypothesis was applied
    :param df: the subdataframe
    :param class_col: the column representing the class
    :return: the gini index, a float:[0.0; 1.0]
    """
    occurrences = df.groupby([class_col]).size()
    proportions = occurrences / df.shape[0]
    squared = proportions ** 2
    res = 1 - squared.sum()

    return res


def test_num_hypothesis(df: pd.DataFrame, col: str, value: float, res_col: str) -> float:
    """
    Tests a numerical hypothesis for a numerical variable, returning the total gini impurity
    :param df: the dataframe to test the hypothesis on
    :param col: the colum of the hypothesis
    :param value: the value of the hypothesis
    :param res_col: the column of the result classes
    :return: the total weighted gini impurity
    """
    left = df[df[col] < value]
    right = df[df[col] >= value]

    left_gini = gini_df(left, res_col)
    right_gini = gini_df(right, res_col)

    # TODO factorise the total gini impurity into one separate function
    total_impurity = len(left) / len(df) * left_gini + len(right) / len(df) * right_gini

    return total_impurity


def test_hypothesis(df: pd.DataFrame, col: str, res_col: str) -> tuple[float, float]:
    """
    test a hypothesis for a numerical variable, returning the total gini impurity
    :param df: the dataframe to test the hypothesis on
    :param col: the colum of the hypothesis
    :param res_col: the column of the result classes
    :return: a tuple, containing the best hypothesis for this column (Impurity, Hypothesis)
    """
    values: list = df.sort_values(col)[col].unique()

    best: tuple = None  # (impurity, hypothesis)

    t = time.time()

    for i, i_ in zip(values[:-1], values[1:]):  # going with elements 2 by 2 to mean them
        hypothesis = (i + i_) / 2.0
        h_impurity = test_num_hypothesis(df, col, hypothesis, res_col)
        if best is None or h_impurity < best[0]:
            best = (h_impurity, hypothesis)
    return best

分享到QQ

分享到微博