如何改善我的回归模型在随机森林回归中更准确
问题:将R2接近0.64。想更多地改善我的结果。不知道这些结果是什么问题。已经删除异常值,转换字符串 - >数值,归一化。想知道我的输出有任何问题吗?如果我没有正确地问问题,请问我任何事情。这只是我在堆栈溢出上的起点。
y.value_counts()
3.3 215
3.0 185
2.7 154
3.7 134
2.3 96
4.0 54
2.0 31
1.7 21
1.3 20
这是我的输出的直方图。我在回归方面不是专业的需要的超级帮助。
删除我的输入中的截线
import seaborn as sns
# data=z_scores(df)
data=df
correlation=data.corr()
k=22
cols=correlation.nlargest(k,'Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]')['Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]'].index
cm=np.corrcoef(data[cols].values.T)
f,ax=plt.subplots(figsize=(15,15))
sns.heatmap(cm,vmax=.8,linewidths=0.01,square=True,annot=True,cmap='viridis',
linecolor="white",xticklabels=cols.values,annot_kws={'size':12},yticklabels=cols.values)
cols=pd.DataFrame(cols)
cols=cols.set_axis(["Selected Features"], axis=1)
cols=cols[cols['Selected Features'] != 'Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]']
cols=cols[cols['Selected Features'] != 'Your Fsc/Ics marks percentage?']
X=df[cols['Selected Features'].tolist()]
X
然后然后然后随机应用随机随机森林回归者并获得了这些结果
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
model=regressor.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MAE Score: ", mean_absolute_error(y_test, y_pred))
print("MSE Score: ", mean_squared_error(y_test, y_pred))
print("RMSE Score: ", math.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 score : %.2f" %r2_score(y_test,y_pred))
得到了这些结果。
MAE Score: 0.252967032967033
MSE Score: 0.13469450549450546
RMSE Score: 0.36700750059706605
R2 score : 0.64
Issue: Getting r2 near to 0.64. Want to improve my results more. Don't know what's the issue of these results. Have done Removing outliers, Converting String -> Numerical, normalization. Wanna know is there any issue with my output? Please ask me anything if I didn't ask the question correctly. It's just my starting on Stack overflow.
y.value_counts()
3.3 215
3.0 185
2.7 154
3.7 134
2.3 96
4.0 54
2.0 31
1.7 21
1.3 20
This is histogram of my outputs. I am not professional in Regression need super help from your side.
Removing Collinearity in my inputs
import seaborn as sns
# data=z_scores(df)
data=df
correlation=data.corr()
k=22
cols=correlation.nlargest(k,'Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]')['Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]'].index
cm=np.corrcoef(data[cols].values.T)
f,ax=plt.subplots(figsize=(15,15))
sns.heatmap(cm,vmax=.8,linewidths=0.01,square=True,annot=True,cmap='viridis',
linecolor="white",xticklabels=cols.values,annot_kws={'size':12},yticklabels=cols.values)
cols=pd.DataFrame(cols)
cols=cols.set_axis(["Selected Features"], axis=1)
cols=cols[cols['Selected Features'] != 'Please enter your Subjects GPA which you have studied? (CS) [Introduction to ICT]']
cols=cols[cols['Selected Features'] != 'Your Fsc/Ics marks percentage?']
X=df[cols['Selected Features'].tolist()]
X
Then applied Random Forest Regressor and got these results
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
model=regressor.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MAE Score: ", mean_absolute_error(y_test, y_pred))
print("MSE Score: ", mean_squared_error(y_test, y_pred))
print("RMSE Score: ", math.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 score : %.2f" %r2_score(y_test,y_pred))
Got these Results.
MAE Score: 0.252967032967033
MSE Score: 0.13469450549450546
RMSE Score: 0.36700750059706605
R2 score : 0.64
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了获得更好的结果,您需要进行超参数调整,请尝试专注于这些
k倍交叉验证
使用网格搜索cv
in order to get better results you need to do hyper-parameter tuning try to focus on these
k fold cross validation
use grid search cv