使用 xgboost 时,分割测试/训练数据的样本数量不一致
我是机器学习的新手,所以要温柔。我有一个用于数据的CSV文件,我想将其分为测试/火车数据。 但是,我已经使用以下代码将数据拆分
raw_data1. drop('Income_Weekly', axis=1, inplace=True)
df = raw_data1
df['split'] = np.random.randn(df.shape[0], 1)
msk = np.random.rand(len(df)) <= 0.5
X_train = df[msk]
y_train = df[~msk]
,但是,在尝试应用XGBoost算法时,我会收到一个错误:
ValueError: Found input variables with inconsistent numbers of samples: [4791, 5006]
该错误发生在行中:
random_cv.fit(X_train,y_train)
我的完整代码如下:
import xgboost
from sklearn.model_selection import RandomizedSearchCV
classifier=xgboost.XGBRegressor()
regressor=xgboost.XGBRegressor()
booster=['gbtree','gblinear']
base_score=[0.25,0.5,0.75,1]
## Hyper Parameter Optimization
#
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]
# Define the grid of hyperparameters to search
hyperparameter_grid = {
'n_estimators': n_estimators,
'max_depth':max_depth,
'learning_rate':learning_rate,
'min_child_weight':min_child_weight,
'booster':booster,
'base_score':base_score
}
random_cv = RandomizedSearchCV(estimator=regressor,
param_distributions=hyperparameter_grid,
cv=5, n_iter=50,
scoring = 'neg_mean_absolute_error',n_jobs = 4,
verbose = 5,
return_train_score = True,
random_state=42)
random_cv.fit(X_train,y_train)
I am new to machine learning so be gentle. I have a single csv file for data, that I would like to split into test/train data. I have used the following code to split the data
raw_data1. drop('Income_Weekly', axis=1, inplace=True)
df = raw_data1
df['split'] = np.random.randn(df.shape[0], 1)
msk = np.random.rand(len(df)) <= 0.5
X_train = df[msk]
y_train = df[~msk]
However, when trying to apply the xgboost algorithm, I receive an error:
ValueError: Found input variables with inconsistent numbers of samples: [4791, 5006]
The error occurs at the line:
random_cv.fit(X_train,y_train)
My complete code is as follows:
import xgboost
from sklearn.model_selection import RandomizedSearchCV
classifier=xgboost.XGBRegressor()
regressor=xgboost.XGBRegressor()
booster=['gbtree','gblinear']
base_score=[0.25,0.5,0.75,1]
## Hyper Parameter Optimization
#
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]
# Define the grid of hyperparameters to search
hyperparameter_grid = {
'n_estimators': n_estimators,
'max_depth':max_depth,
'learning_rate':learning_rate,
'min_child_weight':min_child_weight,
'booster':booster,
'base_score':base_score
}
random_cv = RandomizedSearchCV(estimator=regressor,
param_distributions=hyperparameter_grid,
cv=5, n_iter=50,
scoring = 'neg_mean_absolute_error',n_jobs = 4,
verbose = 5,
return_train_score = True,
random_state=42)
random_cv.fit(X_train,y_train)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您当前的掩码和分割方法会产生两个不同大小的数据帧,无法与 xgboost 进行比较,它们应该具有相同的大小。
也许更改分割代码以使用 sklearn 的训练测试分割。
看起来您正在使用随机数来分割集合,如果需要,您可以更改 test_size 以使用随机数,但经常使用 0.2/ 0.3/ 0.33。
list_x_cols 应包含所有特征列,例如 df['xvar1', 'xvar2', 'xvar3', ...]。
要测试分割的形状是否未对齐,可以使用以下代码。请分享打印报表的结果:
Your current mask and splitting method results in two different size dataframes which can't be compared with xgboost, they should be the same size.
Maybe change the splitting code to use sklearn's train test split.
Looks like you are using a random number to split the set, you can change test_size to use a random number if you want but 0.2/ 0.3/ 0.33 are used often.
list_x_cols should contain all your feature columns e.g. df['xvar1', 'xvar2', 'xvar3', ...].
To the test if the shape of your split is not aligned, the following code can be used. Please share the outcome of the print statements: