根据Python中的XGBoost的其他列预测一列

发布于 2025-02-12 08:56:19 字数 1794 浏览 1 评论 0原文

我有一个较大的数据框架,我想根据带有XGBoost的其他列来预测最后一列,下面我的代码写在下面,但我的预测是错误的,我得到了恒定值。 数据不是时间序列,我的树也无法绘制。

总的来说,有20列可以通过使用此方法使用其他第19列来预测20列?

#XGBoost

    import xgboost as xgb
    from sklearn.metrics import mean_squared_error

#Separate the target variable

    X, y = f.iloc[:,:-1],f.iloc[:,-1]

    data_dmatrix = xgb.DMatrix(data=X,label=y)

 

    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123) 


#Regressor

    xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                    max_depth = 5, alpha = 10, n_estimators = 10)


#Fit the regressor to the training set and make predictions on the test set

    xg_reg.fit(X_train,y_train)
    
    preds = xg_reg.predict(X_test)


#RMSE

    rmse = np.sqrt(mean_squared_error(y_test, preds))
    print("RMSE: %f" % (rmse))


#k-fold Cross Validation

    params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
                    'max_depth': 10, 'alpha': 10}
    
    cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                        num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
    
    print((cv_results["test-rmse-mean"]).tail(1))

#Visualizing

    xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)

#plot the trees

    import matplotlib.pyplot as plt
    
    xgb.plot_tree(xg_reg,num_trees=5)
    plt.rcParams['figure.figsize'] = [50, 10]
    plt.show()

#Examine the importance of each feature column in the original dataset within the model

    xgb.plot_importance(xg_reg)

    plt.rcParams['figure.figsize'] = [5, 5]

    plt.show()

I have a large dataframe, and I want to predict the last column based on the other columns with xgboost, my codes are written below, but my prediction is wrong and I get the constant value.
the Data is not time-series, my trees also cant be plotted.

Overall is it possible by having 20 columns and I just wanna predict the 20th one by using the other 19th columns with this method?

#XGBoost

    import xgboost as xgb
    from sklearn.metrics import mean_squared_error

#Separate the target variable

    X, y = f.iloc[:,:-1],f.iloc[:,-1]

    data_dmatrix = xgb.DMatrix(data=X,label=y)

 

    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123) 


#Regressor

    xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                    max_depth = 5, alpha = 10, n_estimators = 10)


#Fit the regressor to the training set and make predictions on the test set

    xg_reg.fit(X_train,y_train)
    
    preds = xg_reg.predict(X_test)


#RMSE

    rmse = np.sqrt(mean_squared_error(y_test, preds))
    print("RMSE: %f" % (rmse))


#k-fold Cross Validation

    params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
                    'max_depth': 10, 'alpha': 10}
    
    cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                        num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
    
    print((cv_results["test-rmse-mean"]).tail(1))

#Visualizing

    xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)

#plot the trees

    import matplotlib.pyplot as plt
    
    xgb.plot_tree(xg_reg,num_trees=5)
    plt.rcParams['figure.figsize'] = [50, 10]
    plt.show()

#Examine the importance of each feature column in the original dataset within the model

    xgb.plot_importance(xg_reg)

    plt.rcParams['figure.figsize'] = [5, 5]

    plt.show()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

提赋 2025-02-19 08:56:19

首先,是的,用前19列预测最后一列的方法是可以的。

如果模型仅产生恒定值,我将更改模型的参数。

或将线性模型训练为基线。

First of all, yes, the approach to predict the last column with the first 19 columns is ok.

If the model only produces constant values, I would change the parameters of the model.

Or train a linear model as a baseline first.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文