当前位置：文江博客话题详情

Python：具有离散（分类）特征的随机森林回归？

发布于 2025-01-16 22:32:37 字数 1202 浏览 3 评论 0原文

我使用随机森林回归器，因为我的目标值不是分类的。不过，特点是。

当我运行该算法时，它将它们视为连续变量。

有什么方法可以将它们视为类别吗？

例如：

当我尝试随机森林回归器时，它将用户 ID 视为连续的（取值 1.5 等）。

数据帧中的 dtype 是 int64。

你能帮我吗？

谢谢

这是我尝试过的代码：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np

df = pd.read_excel('Data_frame.xlsx', sheet_name=5)
df.head
df.dtypes



X = df.drop('productivity', axis='columns')
y = df['productivity']


X_train, X_test, y_train, y_test = train_test_split(X, y)
rf = RandomForestRegressor(bootstrap=False, n_estimators=1000, criterion='squared_error', max_depth=5, max_features='sqrt')
rf.fit(X_train.values, y_train)

plt.figure(figsize=(15,20))
_ = tree.plot_tree(rf.estimators_[1], feature_names=X.columns, filled=True,fontsize=8)

y_predict = rf.predict(X_test.values)
mae = mean_absolute_error(y_predict,y_test)
print(mae)

原文

I am using random forest regressor as my target values is not categorial. However, the features are.

When I run the algorithm it treats them as continuous variables.

Is there any way to treat them as categorial?

example:

when I try random forest regressor it treats user ID for example as continuous (taking values 1.5 etc.)

The dtype in the data frame is int64.

Could you help me with that?

thanks

here is the code I have tried:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np

df = pd.read_excel('Data_frame.xlsx', sheet_name=5)
df.head
df.dtypes



X = df.drop('productivity', axis='columns')
y = df['productivity']


X_train, X_test, y_train, y_test = train_test_split(X, y)
rf = RandomForestRegressor(bootstrap=False, n_estimators=1000, criterion='squared_error', max_depth=5, max_features='sqrt')
rf.fit(X_train.values, y_train)

plt.figure(figsize=(15,20))
_ = tree.plot_tree(rf.estimators_[1], feature_names=X.columns, filled=True,fontsize=8)

y_predict = rf.predict(X_test.values)
mae = mean_absolute_error(y_predict,y_test)
print(mae)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇纵 2025-01-23 22:32:37

首先，RandomForestRegressor 只接受数值。因此，将数值编码为分类并不是解决方案，因为您将无法训练模型。

处理此类问题的方法是OneHotEncoder。此函数将为指定特征中的每个值创建一列。

下面是代码示例：

# creating initial dataframe
values = (1,10,1,2,2,3,4)
df = pd.DataFrame(values, columns=['Numerical_data'])

Datafram 将如下所示：

    Numerical_data
0   1
1   10
2   1
3   2
4   2
5   3
6   4

现在，OneHotEncode 它：

enc = OneHotEncoder(handle_unknown='ignore') 
enc_df = pd.DataFrame(enc.fit_transform(df[['Bridge_Types']]).toarray())
enc_df

    0   1   2   3   4
0   1.0 0.0 0.0 0.0 0.0
1   0.0 0.0 0.0 0.0 1.0
2   1.0 0.0 0.0 0.0 0.0
3   0.0 1.0 0.0 0.0 0.0
4   0.0 1.0 0.0 0.0 0.0
5   0.0 0.0 1.0 0.0 0.0
6   0.0 0.0 0.0 1.0 0.0

然后，根据您的需要，您可以将此计算框架加入到您的 DataSet 中。请注意，您应该删除初始特征：

# merge with main df bridge_df on key values

df = df.join(enc_df)
df

    Numerical_data 0    1   2   3   4
0   1   1.0 0.0 0.0 0.0 0.0
1   10  0.0 0.0 0.0 0.0 1.0
2   1   1.0 0.0 0.0 0.0 0.0
3   2   0.0 1.0 0.0 0.0 0.0
4   2   0.0 1.0 0.0 0.0 0.0
5   3   0.0 0.0 1.0 0.0 0.0
6   4   0.0 0.0 0.0 1.0 0.0

当然，如果您指定的特征中有数百个不同的值，则会创建许多列。但这是继续下去的方法。

First of all, RandomForestRegressor only accepts numerical values. So encoding your numerical values to categorical is not a solution because you are not going to be able to train you model.

The way to deal with this type of problem is OneHotEncoder. This function will create one column for every value that you have in the specified feature.

Below there is the example of code:

# creating initial dataframe
values = (1,10,1,2,2,3,4)
df = pd.DataFrame(values, columns=['Numerical_data'])

Datafram will look like this:

    Numerical_data
0   1
1   10
2   1
3   2
4   2
5   3
6   4

Now, OneHotEncode it:

enc = OneHotEncoder(handle_unknown='ignore') 
enc_df = pd.DataFrame(enc.fit_transform(df[['Bridge_Types']]).toarray())
enc_df

    0   1   2   3   4
0   1.0 0.0 0.0 0.0 0.0
1   0.0 0.0 0.0 0.0 1.0
2   1.0 0.0 0.0 0.0 0.0
3   0.0 1.0 0.0 0.0 0.0
4   0.0 1.0 0.0 0.0 0.0
5   0.0 0.0 1.0 0.0 0.0
6   0.0 0.0 0.0 1.0 0.0

Then, depending your necesities, you can join this calculated frame to you DataSet. Be aware that you should remove the initial feature:

# merge with main df bridge_df on key values

df = df.join(enc_df)
df

    Numerical_data 0    1   2   3   4
0   1   1.0 0.0 0.0 0.0 0.0
1   10  0.0 0.0 0.0 0.0 1.0
2   1   1.0 0.0 0.0 0.0 0.0
3   2   0.0 1.0 0.0 0.0 0.0
4   2   0.0 1.0 0.0 0.0 0.0
5   3   0.0 0.0 1.0 0.0 0.0
6   4   0.0 0.0 0.0 1.0 0.0

Of course, if you have hundreds of different values in your specified feature, many columns will be created. But this is the way to proceed.

回复收藏 0 原文

~没有更多了~