将稀疏矩阵转换为数组时，用XGBoost获得不同的预测？

发布于 2025-02-10 22:44:09 字数 647 浏览 3 评论 0原文

我有一个试图预测供应商类别的模型。

count_vect = CountVectorizer(ngram_range = (1,2), min_df = 5, binary = True)
X_train_counts = count_vect.fit_transform(X_train)
clf = XGBClassifier().fit(X_train_counts, y_train)

my_data = pd.Series(['ebay', 'amazon', 'costco', 'uber'])

sparse_count = count_vect.transform(my_data)
print(clf.predict(sparse_count))
print(clf.predict(sparse_count.toarray()))

这使输出

[19 16  1 18]
[20 20 20 20]

如此清晰地传递到数组会改变预测。看起来稀疏的预测是正确的，但是我想了解导致这一点的原因。这似乎与稀疏矩阵的训练有关，然后预测引起问题的阵列。

如果我首先将稀疏矩阵转换为阵列和火车（这需要更长的时间），如果我通过稀疏或数组，我会得到相同的预测。但是，当我在稀疏中训练时，如果我预测数组，我会得到奇怪的预测。这是为什么？

原文

I have a model trying to predict the category of a vendor.

count_vect = CountVectorizer(ngram_range = (1,2), min_df = 5, binary = True)
X_train_counts = count_vect.fit_transform(X_train)
clf = XGBClassifier().fit(X_train_counts, y_train)

my_data = pd.Series(['ebay', 'amazon', 'costco', 'uber'])

sparse_count = count_vect.transform(my_data)
print(clf.predict(sparse_count))
print(clf.predict(sparse_count.toarray()))

This gives output

[19 16  1 18]
[20 20 20 20]

So clearly passing it as an array changes the prediction. It definitely looks like the sparse predictions are the correct ones, but I want to understand what is causing this. It seems to have something to do with training on a sparse matrix and then predicting on an array that is causing issues.

If I first convert the sparse matrix to an array and train (which takes much longer) I get the same predictions if I pass as sparse or array. But when I train on sparse, I get weird predictions if I predict on array. Why is that?

分享到QQ

分享到微博