Kaggle竞赛:分类变量

发布于 2025-02-12 17:57:58 字数 3218 浏览 1 评论 0原文

在分类变量练习下,生成测试预测的最后一部分。我已经编写了以下代码,但是遇到了错误。我无法理解该错误,以及为什么它说X具有148个功能,而随机森林期望有155个功能。

我的代码:

ohencoder=OneHotEncoder(handle_unknown='ignore', sparse=False)

# X_test.dropna(axis=0, inplace=True)
h_cols_test = pd.DataFrame(ohencoder.fit_transform(X_test[low_cardinality_cols])) # Your code here

h_cols_test.index=X_test.index

num_X_test= X_test.drop(object_cols, axis=1)

OH_X_test=pd.concat([num_X_test, h_cols_test], axis=1)
#randomforest mode-----------------------------

model=RandomForestRegressor(n_estimators=100,  random_state=0)
model.fit(OH_X_train, y_train)

preds_test= model.predict(OH_X_test)
#output---------------

output=pd.DataFrame({'Id': X_test.index,
               'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

错误消息:

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  FutureWarning,
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_33/1524045498.py in <module>
     12 model.fit(OH_X_train, y_train)
     13 
---> 14 preds_test= model.predict(OH_X_test)
     15 
     16 output=pd.DataFrame({'Id': X_test.index,

/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
    969         check_is_fitted(self)
    970         # Check data
--> 971         X = self._validate_X_predict(X)
    972 
    973         # Assign chunk of trees to jobs

/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
    577         Validate X whenever one tries to predict, apply, predict_proba."""
    578         check_is_fitted(self)
--> 579         X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
    580         if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
    581             raise ValueError("No support for np.int64 index based sparse matrices")

/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    583 
    584         if not no_val_X and check_params.get("ensure_2d", True):
--> 585             self._check_n_features(X, reset=reset)
    586 
    587         return out

/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _check_n_features(self, X, reset)
    399         if n_features != self.n_features_in_:
    400             raise ValueError(
--> 401                 f"X has {n_features} features, but {self.__class__.__name__} "
    402                 f"is expecting {self.n_features_in_} features as input."
    403             )

ValueError: X has 148 features, but RandomForestRegressor is expecting 155 features as input.

Under the categorical variables exercise there is the last part of generating test predictions. I have written the following code but getting an error. I am unable to understand the error and why it's saying X has 148 features and random forest is expecting 155 features.

My code:

ohencoder=OneHotEncoder(handle_unknown='ignore', sparse=False)

# X_test.dropna(axis=0, inplace=True)
h_cols_test = pd.DataFrame(ohencoder.fit_transform(X_test[low_cardinality_cols])) # Your code here

h_cols_test.index=X_test.index

num_X_test= X_test.drop(object_cols, axis=1)

OH_X_test=pd.concat([num_X_test, h_cols_test], axis=1)
#randomforest mode-----------------------------

model=RandomForestRegressor(n_estimators=100,  random_state=0)
model.fit(OH_X_train, y_train)

preds_test= model.predict(OH_X_test)
#output---------------

output=pd.DataFrame({'Id': X_test.index,
               'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

Error message:

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  FutureWarning,
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_33/1524045498.py in <module>
     12 model.fit(OH_X_train, y_train)
     13 
---> 14 preds_test= model.predict(OH_X_test)
     15 
     16 output=pd.DataFrame({'Id': X_test.index,

/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
    969         check_is_fitted(self)
    970         # Check data
--> 971         X = self._validate_X_predict(X)
    972 
    973         # Assign chunk of trees to jobs

/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
    577         Validate X whenever one tries to predict, apply, predict_proba."""
    578         check_is_fitted(self)
--> 579         X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
    580         if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
    581             raise ValueError("No support for np.int64 index based sparse matrices")

/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    583 
    584         if not no_val_X and check_params.get("ensure_2d", True):
--> 585             self._check_n_features(X, reset=reset)
    586 
    587         return out

/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _check_n_features(self, X, reset)
    399         if n_features != self.n_features_in_:
    400             raise ValueError(
--> 401                 f"X has {n_features} features, but {self.__class__.__name__} "
    402                 f"is expecting {self.n_features_in_} features as input."
    403             )

ValueError: X has 148 features, but RandomForestRegressor is expecting 155 features as input.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

说好的呢 2025-02-19 17:57:58

您在培训和测试集中具有不同数量的功能。因此,训练中可能存在模型在测试中找不到的功能,或者在测试中未经训练的特征。

此错误的可能原因是在每个数据集中单独执行时进行单置编码:对于一个测试之一中仅存在的分类变量可能存在值。

一种解决方案是在拆分数据之前执行OHE,或者,您可以在训练集中使用fit_transform,然后在测试集中仅使用transform记住在处理新数据时,您应该始终使用变换,这是所有Scikit Transformers的一般规则。

当然,您还应确保所有其他造物(例如丢弃列)在培训和测试集中都执行了相同的操作。管道是您最好的朋友。

You have different amount of features in training and test sets. So, there may be features present in the training that the model cannot find in the test, or features in the test for which the model was not trained.

A possible reason for this error is the one-hot-encoding when it's performed separately in each data set: there may be values for categorical variables that are only present in one of the tests.

One solution is to perform the OHE before splitting the data or, alternatively, you can use fit_transform with your training set and then only transform with your test set. Remember that you should always use transform when processing novel data, and this is a general rule for all scikit transformers.

Of course you should also make sure that all other trasformations, like droping columns were performed the same in both training and test sets. Pipelines are here your best friends.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文