根据Python中的XGBoost的其他列预测一列
我有一个较大的数据框架,我想根据带有XGBoost的其他列来预测最后一列,下面我的代码写在下面,但我的预测是错误的,我得到了恒定值。 数据不是时间序列,我的树也无法绘制。
总的来说,有20列可以通过使用此方法使用其他第19列来预测20列?
#XGBoost
import xgboost as xgb
from sklearn.metrics import mean_squared_error
#Separate the target variable
X, y = f.iloc[:,:-1],f.iloc[:,-1]
data_dmatrix = xgb.DMatrix(data=X,label=y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)
#Regressor
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 10)
#Fit the regressor to the training set and make predictions on the test set
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
#RMSE
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
#k-fold Cross Validation
params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 10, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
print((cv_results["test-rmse-mean"]).tail(1))
#Visualizing
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)
#plot the trees
import matplotlib.pyplot as plt
xgb.plot_tree(xg_reg,num_trees=5)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()
#Examine the importance of each feature column in the original dataset within the model
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()
I have a large dataframe, and I want to predict the last column based on the other columns with xgboost, my codes are written below, but my prediction is wrong and I get the constant value.
the Data is not time-series, my trees also cant be plotted.
Overall is it possible by having 20 columns and I just wanna predict the 20th one by using the other 19th columns with this method?
#XGBoost
import xgboost as xgb
from sklearn.metrics import mean_squared_error
#Separate the target variable
X, y = f.iloc[:,:-1],f.iloc[:,-1]
data_dmatrix = xgb.DMatrix(data=X,label=y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)
#Regressor
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 10)
#Fit the regressor to the training set and make predictions on the test set
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
#RMSE
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
#k-fold Cross Validation
params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 10, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
print((cv_results["test-rmse-mean"]).tail(1))
#Visualizing
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)
#plot the trees
import matplotlib.pyplot as plt
xgb.plot_tree(xg_reg,num_trees=5)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()
#Examine the importance of each feature column in the original dataset within the model
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,是的,用前19列预测最后一列的方法是可以的。
如果模型仅产生恒定值,我将更改模型的参数。
或将线性模型训练为基线。
First of all, yes, the approach to predict the last column with the first 19 columns is ok.
If the model only produces constant values, I would change the parameters of the model.
Or train a linear model as a baseline first.