使用 xgboost 时,分割测试/训练数据的样本数量不一致

发布于 2025-01-19 09:22:59 字数 1542 浏览 1 评论 0原文

我是机器学习的新手,所以要温柔。我有一个用于数据的CSV文件,我想将其分为测试/火车数据。 但是,我已经使用以下代码将数据拆分

raw_data1. drop('Income_Weekly', axis=1, inplace=True) 

df = raw_data1

df['split'] = np.random.randn(df.shape[0], 1)

msk = np.random.rand(len(df)) <= 0.5

X_train = df[msk]
y_train = df[~msk]

,但是,在尝试应用XGBoost算法时,我会收到一个错误:

ValueError: Found input variables with inconsistent numbers of samples: [4791, 5006]

该错误发生在行中:

random_cv.fit(X_train,y_train)

我的完整代码如下:

import xgboost
from sklearn.model_selection import RandomizedSearchCV

classifier=xgboost.XGBRegressor()
regressor=xgboost.XGBRegressor() 

booster=['gbtree','gblinear']
base_score=[0.25,0.5,0.75,1]

## Hyper Parameter Optimization
#

n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]

# Define the grid of hyperparameters to search
hyperparameter_grid = {
    'n_estimators': n_estimators,
    'max_depth':max_depth,
    'learning_rate':learning_rate,
    'min_child_weight':min_child_weight,
    'booster':booster,
    'base_score':base_score
    }

random_cv = RandomizedSearchCV(estimator=regressor,
            param_distributions=hyperparameter_grid,
            cv=5, n_iter=50,
            scoring = 'neg_mean_absolute_error',n_jobs = 4,
            verbose = 5, 
            return_train_score = True,
            random_state=42)

random_cv.fit(X_train,y_train)

I am new to machine learning so be gentle. I have a single csv file for data, that I would like to split into test/train data. I have used the following code to split the data

raw_data1. drop('Income_Weekly', axis=1, inplace=True) 

df = raw_data1

df['split'] = np.random.randn(df.shape[0], 1)

msk = np.random.rand(len(df)) <= 0.5

X_train = df[msk]
y_train = df[~msk]

However, when trying to apply the xgboost algorithm, I receive an error:

ValueError: Found input variables with inconsistent numbers of samples: [4791, 5006]

The error occurs at the line:

random_cv.fit(X_train,y_train)

My complete code is as follows:

import xgboost
from sklearn.model_selection import RandomizedSearchCV

classifier=xgboost.XGBRegressor()
regressor=xgboost.XGBRegressor() 

booster=['gbtree','gblinear']
base_score=[0.25,0.5,0.75,1]

## Hyper Parameter Optimization
#

n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]

# Define the grid of hyperparameters to search
hyperparameter_grid = {
    'n_estimators': n_estimators,
    'max_depth':max_depth,
    'learning_rate':learning_rate,
    'min_child_weight':min_child_weight,
    'booster':booster,
    'base_score':base_score
    }

random_cv = RandomizedSearchCV(estimator=regressor,
            param_distributions=hyperparameter_grid,
            cv=5, n_iter=50,
            scoring = 'neg_mean_absolute_error',n_jobs = 4,
            verbose = 5, 
            return_train_score = True,
            random_state=42)

random_cv.fit(X_train,y_train)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

神爱温柔 2025-01-26 09:22:59

您当前的掩码和分割方法会产生两个不同大小的数据帧,无法与 xgboost 进行比较,它们应该具有相同的大小。

也许更改分割代码以使用 sklearn 的训练测试分割。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[list_x_cols], df['y_col'],test_size=0.2)

看起来您正在使用随机数来分割集合,如果需要,您可以更改 test_size 以使用随机数,但经常使用 0.2/ 0.3/ 0.33。

list_x_cols 应包含所有特征列,例如 df['xvar1', 'xvar2', 'xvar3', ...]。

要测试分割的形状是否未对齐,可以使用以下代码。请分享打印报表的结果:

print(X_train.shape)
print(y_train.shape)

if X_train.shape[0] != y_train.shape[0]:
  print("X and y rows imbalanced")

Your current mask and splitting method results in two different size dataframes which can't be compared with xgboost, they should be the same size.

Maybe change the splitting code to use sklearn's train test split.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[list_x_cols], df['y_col'],test_size=0.2)

Looks like you are using a random number to split the set, you can change test_size to use a random number if you want but 0.2/ 0.3/ 0.33 are used often.

list_x_cols should contain all your feature columns e.g. df['xvar1', 'xvar2', 'xvar3', ...].

To the test if the shape of your split is not aligned, the following code can be used. Please share the outcome of the print statements:

print(X_train.shape)
print(y_train.shape)

if X_train.shape[0] != y_train.shape[0]:
  print("X and y rows imbalanced")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文