如何使用连续数据进行有效的随机森林?
我试图通过在Python中从头开始创建一个随机森林。
为此,我创建了倍数决策树,将参数吸收pandas dataframe,然后使用gini杂质作为指标来拆分我的节点。
我在这里面对连续数据,所以我有一个计算问题: 该数据集包含虚拟成绩的学生,所有等级范围在不同的间隔之间([-2; 2],[0; 100],...)。该数据集中有1600行,13个等级。
问题是,对于每一列,我检查了所有可能的拆分值,因此对于第一个节点,我将对13列中的每个节点检查1599个可能的值(每个值占两个邻居数据的平均值)。我这样做是为了找到最佳价值的最佳列。
这部分需要超过3秒钟的时间,因此一棵树具有多个层次,并且在随机的森林需要多棵树中,需要花费几个小时。
使用Scikit-Learn,我试图运行Random ForestClasifier,并以毫秒的方式运行,因此我知道可以使它变得更好。
我想念什么?任何指针都可以在我的实施方式上是可以的。
下面是我的算法,可以找到最佳的列/价值对以拆分我的树
def gini_df(df: pd.DataFrame, class_col: str) -> float:
"""
Functiion returning the gini index of a sub dataframe, where and hypothesis was applied
:param df: the subdataframe
:param class_col: the column representing the class
:return: the gini index, a float:[0.0; 1.0]
"""
occurrences = df.groupby([class_col]).size()
proportions = occurrences / df.shape[0]
squared = proportions ** 2
res = 1 - squared.sum()
return res
def test_num_hypothesis(df: pd.DataFrame, col: str, value: float, res_col: str) -> float:
"""
Tests a numerical hypothesis for a numerical variable, returning the total gini impurity
:param df: the dataframe to test the hypothesis on
:param col: the colum of the hypothesis
:param value: the value of the hypothesis
:param res_col: the column of the result classes
:return: the total weighted gini impurity
"""
left = df[df[col] < value]
right = df[df[col] >= value]
left_gini = gini_df(left, res_col)
right_gini = gini_df(right, res_col)
# TODO factorise the total gini impurity into one separate function
total_impurity = len(left) / len(df) * left_gini + len(right) / len(df) * right_gini
return total_impurity
def test_hypothesis(df: pd.DataFrame, col: str, res_col: str) -> tuple[float, float]:
"""
test a hypothesis for a numerical variable, returning the total gini impurity
:param df: the dataframe to test the hypothesis on
:param col: the colum of the hypothesis
:param res_col: the column of the result classes
:return: a tuple, containing the best hypothesis for this column (Impurity, Hypothesis)
"""
values: list = df.sort_values(col)[col].unique()
best: tuple = None # (impurity, hypothesis)
t = time.time()
for i, i_ in zip(values[:-1], values[1:]): # going with elements 2 by 2 to mean them
hypothesis = (i + i_) / 2.0
h_impurity = test_num_hypothesis(df, col, hypothesis, res_col)
if best is None or h_impurity < best[0]:
best = (h_impurity, hypothesis)
return best
I am trying to learn about Random Forests by creating one from scratch in Python.
To do so, I create multiples Decision Trees, taking in parameter a pandas DataFrame, and using the gini impurity as the metric to split my nodes.
I am here facing continuous data, and so I am having a computation problem:
This dataset contains fictive grades of students, all grades ranging between different intervals ([-2;2], [0;100], ...). There is 1600 lines in this dataset, and 13 grades.
The problem is that for every column, I check every possible split value, so for the first node, I will check 1599 possible values (each values beeing the mean of two neighbor data) for each of the 13 columns. I do so to find the best column with the best possible value.
This part takes over 3 seconds, for 1 split, so a Tree having multiple levels, and in a Random Forest needind multiple trees, it'd take hours.
Using scikit-learn, I tried to run the RandomForestClasifier, and it ran in a manner of milliseconds, so I know it is possible to make it way better.
What am I missing ? Any pointers would be fine in how my implementetion is wrong.
Below is my algorithm to find the best column/value pair to split my tree
def gini_df(df: pd.DataFrame, class_col: str) -> float:
"""
Functiion returning the gini index of a sub dataframe, where and hypothesis was applied
:param df: the subdataframe
:param class_col: the column representing the class
:return: the gini index, a float:[0.0; 1.0]
"""
occurrences = df.groupby([class_col]).size()
proportions = occurrences / df.shape[0]
squared = proportions ** 2
res = 1 - squared.sum()
return res
def test_num_hypothesis(df: pd.DataFrame, col: str, value: float, res_col: str) -> float:
"""
Tests a numerical hypothesis for a numerical variable, returning the total gini impurity
:param df: the dataframe to test the hypothesis on
:param col: the colum of the hypothesis
:param value: the value of the hypothesis
:param res_col: the column of the result classes
:return: the total weighted gini impurity
"""
left = df[df[col] < value]
right = df[df[col] >= value]
left_gini = gini_df(left, res_col)
right_gini = gini_df(right, res_col)
# TODO factorise the total gini impurity into one separate function
total_impurity = len(left) / len(df) * left_gini + len(right) / len(df) * right_gini
return total_impurity
def test_hypothesis(df: pd.DataFrame, col: str, res_col: str) -> tuple[float, float]:
"""
test a hypothesis for a numerical variable, returning the total gini impurity
:param df: the dataframe to test the hypothesis on
:param col: the colum of the hypothesis
:param res_col: the column of the result classes
:return: a tuple, containing the best hypothesis for this column (Impurity, Hypothesis)
"""
values: list = df.sort_values(col)[col].unique()
best: tuple = None # (impurity, hypothesis)
t = time.time()
for i, i_ in zip(values[:-1], values[1:]): # going with elements 2 by 2 to mean them
hypothesis = (i + i_) / 2.0
h_impurity = test_num_hypothesis(df, col, hypothesis, res_col)
if best is None or h_impurity < best[0]:
best = (h_impurity, hypothesis)
return best
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论