Python:具有离散(分类)特征的随机森林回归?
我使用随机森林回归器,因为我的目标值不是分类的。不过,特点是。
当我运行该算法时,它将它们视为连续变量。
有什么方法可以将它们视为类别吗?
例如:
当我尝试随机森林回归器时,它将用户 ID 视为连续的(取值 1.5 等)。
数据帧中的 dtype 是 int64。
你能帮我吗?
谢谢
这是我尝试过的代码:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np
df = pd.read_excel('Data_frame.xlsx', sheet_name=5)
df.head
df.dtypes
X = df.drop('productivity', axis='columns')
y = df['productivity']
X_train, X_test, y_train, y_test = train_test_split(X, y)
rf = RandomForestRegressor(bootstrap=False, n_estimators=1000, criterion='squared_error', max_depth=5, max_features='sqrt')
rf.fit(X_train.values, y_train)
plt.figure(figsize=(15,20))
_ = tree.plot_tree(rf.estimators_[1], feature_names=X.columns, filled=True,fontsize=8)
y_predict = rf.predict(X_test.values)
mae = mean_absolute_error(y_predict,y_test)
print(mae)
I am using random forest regressor as my target values is not categorial. However, the features are.
When I run the algorithm it treats them as continuous variables.
Is there any way to treat them as categorial?
example:
when I try random forest regressor it treats user ID for example as continuous (taking values 1.5 etc.)
The dtype in the data frame is int64.
Could you help me with that?
thanks
here is the code I have tried:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np
df = pd.read_excel('Data_frame.xlsx', sheet_name=5)
df.head
df.dtypes
X = df.drop('productivity', axis='columns')
y = df['productivity']
X_train, X_test, y_train, y_test = train_test_split(X, y)
rf = RandomForestRegressor(bootstrap=False, n_estimators=1000, criterion='squared_error', max_depth=5, max_features='sqrt')
rf.fit(X_train.values, y_train)
plt.figure(figsize=(15,20))
_ = tree.plot_tree(rf.estimators_[1], feature_names=X.columns, filled=True,fontsize=8)
y_predict = rf.predict(X_test.values)
mae = mean_absolute_error(y_predict,y_test)
print(mae)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,RandomForestRegressor 只接受数值。因此,将数值编码为分类并不是解决方案,因为您将无法训练模型。
处理此类问题的方法是
OneHotEncoder
。此函数将为指定特征中的每个值创建一列。下面是代码示例:
Datafram 将如下所示:
现在,OneHotEncode 它:
然后,根据您的需要,您可以将此计算框架加入到您的 DataSet 中。请注意,您应该删除初始特征:
当然,如果您指定的特征中有数百个不同的值,则会创建许多列。但这是继续下去的方法。
First of all,
RandomForestRegressor
only accepts numerical values. So encoding your numerical values to categorical is not a solution because you are not going to be able to train you model.The way to deal with this type of problem is
OneHotEncoder
. This function will create one column for every value that you have in the specified feature.Below there is the example of code:
Datafram will look like this:
Now, OneHotEncode it:
Then, depending your necesities, you can join this calculated frame to you DataSet. Be aware that you should remove the initial feature:
Of course, if you have hundreds of different values in your specified feature, many columns will be created. But this is the way to proceed.