Kaggle竞赛:分类变量
在分类变量练习下,生成测试预测的最后一部分。我已经编写了以下代码,但是遇到了错误。我无法理解该错误,以及为什么它说X具有148个功能,而随机森林期望有155个功能。
我的代码:
ohencoder=OneHotEncoder(handle_unknown='ignore', sparse=False)
# X_test.dropna(axis=0, inplace=True)
h_cols_test = pd.DataFrame(ohencoder.fit_transform(X_test[low_cardinality_cols])) # Your code here
h_cols_test.index=X_test.index
num_X_test= X_test.drop(object_cols, axis=1)
OH_X_test=pd.concat([num_X_test, h_cols_test], axis=1)
#randomforest mode-----------------------------
model=RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(OH_X_train, y_train)
preds_test= model.predict(OH_X_test)
#output---------------
output=pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
错误消息:
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
FutureWarning,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_33/1524045498.py in <module>
12 model.fit(OH_X_train, y_train)
13
---> 14 preds_test= model.predict(OH_X_test)
15
16 output=pd.DataFrame({'Id': X_test.index,
/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
969 check_is_fitted(self)
970 # Check data
--> 971 X = self._validate_X_predict(X)
972
973 # Assign chunk of trees to jobs
/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
577 Validate X whenever one tries to predict, apply, predict_proba."""
578 check_is_fitted(self)
--> 579 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
580 if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
581 raise ValueError("No support for np.int64 index based sparse matrices")
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
583
584 if not no_val_X and check_params.get("ensure_2d", True):
--> 585 self._check_n_features(X, reset=reset)
586
587 return out
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _check_n_features(self, X, reset)
399 if n_features != self.n_features_in_:
400 raise ValueError(
--> 401 f"X has {n_features} features, but {self.__class__.__name__} "
402 f"is expecting {self.n_features_in_} features as input."
403 )
ValueError: X has 148 features, but RandomForestRegressor is expecting 155 features as input.
Under the categorical variables exercise there is the last part of generating test predictions. I have written the following code but getting an error. I am unable to understand the error and why it's saying X has 148 features and random forest is expecting 155 features.
My code:
ohencoder=OneHotEncoder(handle_unknown='ignore', sparse=False)
# X_test.dropna(axis=0, inplace=True)
h_cols_test = pd.DataFrame(ohencoder.fit_transform(X_test[low_cardinality_cols])) # Your code here
h_cols_test.index=X_test.index
num_X_test= X_test.drop(object_cols, axis=1)
OH_X_test=pd.concat([num_X_test, h_cols_test], axis=1)
#randomforest mode-----------------------------
model=RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(OH_X_train, y_train)
preds_test= model.predict(OH_X_test)
#output---------------
output=pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
Error message:
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
FutureWarning,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_33/1524045498.py in <module>
12 model.fit(OH_X_train, y_train)
13
---> 14 preds_test= model.predict(OH_X_test)
15
16 output=pd.DataFrame({'Id': X_test.index,
/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
969 check_is_fitted(self)
970 # Check data
--> 971 X = self._validate_X_predict(X)
972
973 # Assign chunk of trees to jobs
/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
577 Validate X whenever one tries to predict, apply, predict_proba."""
578 check_is_fitted(self)
--> 579 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
580 if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
581 raise ValueError("No support for np.int64 index based sparse matrices")
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
583
584 if not no_val_X and check_params.get("ensure_2d", True):
--> 585 self._check_n_features(X, reset=reset)
586
587 return out
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _check_n_features(self, X, reset)
399 if n_features != self.n_features_in_:
400 raise ValueError(
--> 401 f"X has {n_features} features, but {self.__class__.__name__} "
402 f"is expecting {self.n_features_in_} features as input."
403 )
ValueError: X has 148 features, but RandomForestRegressor is expecting 155 features as input.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您在培训和测试集中具有不同数量的功能。因此,训练中可能存在模型在测试中找不到的功能,或者在测试中未经训练的特征。
此错误的可能原因是在每个数据集中单独执行时进行单置编码:对于一个测试之一中仅存在的分类变量可能存在值。
一种解决方案是在拆分数据之前执行OHE,或者,您可以在训练集中使用
fit_transform
,然后在测试集中仅使用transform
。 记住在处理新数据时,您应该始终使用变换
,这是所有Scikit Transformers的一般规则。当然,您还应确保所有其他造物(例如丢弃列)在培训和测试集中都执行了相同的操作。管道是您最好的朋友。
You have different amount of features in training and test sets. So, there may be features present in the training that the model cannot find in the test, or features in the test for which the model was not trained.
A possible reason for this error is the one-hot-encoding when it's performed separately in each data set: there may be values for categorical variables that are only present in one of the tests.
One solution is to perform the OHE before splitting the data or, alternatively, you can use
fit_transform
with your training set and then onlytransform
with your test set. Remember that you should always usetransform
when processing novel data, and this is a general rule for all scikit transformers.Of course you should also make sure that all other trasformations, like droping columns were performed the same in both training and test sets. Pipelines are here your best friends.