RandomForestRegressor：发现的输入变量，示例数量不一致

发布于 2025-02-13 04:30:45 字数 1073 浏览 1 评论 0原文

这是一个即将到期的项目，因此将不胜感激，我从未做过ML，因此很抱歉，如果错误是绝对光滑的大脑。

我有一个数据集，该数据集以及个性分数以及个性分数，我需要训练一个模型来预测分数。到目前为止，这就是我所做的，通过遵循大量教程并将我学到的东西缝合在一起。

train = pandas.read_csv('../dataset/cleaner_dataset.csv')
train['tweet'] = train['tweet'].str.lower()
train['tweet'] = train['tweet'].replace('[^a-zA-Z0-9]', ' ', regex = True)

X = train['tweet']
y = train['neuroticism']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)

vectorizer = TfidfVectorizer(min_df=5)
X_test_vec = vectorizer.fit_transform(X_train) 

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_vectorized, y_train)

model.score(X_test_vec, y_test)

但是，当我在笔记本中运行时，我在最后一行的代码上会遇到错误。

ValueError: Found input variables with inconsistent numbers of samples: [495, 1980]

完整错误消息： https://i.sstatic.net/cff5w.jpg

原文

This is for a project that's due soon so help would be greatly appreciated, I've never done ML before so sorry if the mistake is an absolute smooth brain one.

I have a dataset that's a bunch of tweets along with personality scores, and I need to train an model to predict the scores.
This is what I've done so far by following a bunch of tutorials and stitching together what I learned.

train = pandas.read_csv('../dataset/cleaner_dataset.csv')
train['tweet'] = train['tweet'].str.lower()
train['tweet'] = train['tweet'].replace('[^a-zA-Z0-9]', ' ', regex = True)

X = train['tweet']
y = train['neuroticism']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)

vectorizer = TfidfVectorizer(min_df=5)
X_test_vec = vectorizer.fit_transform(X_train) 

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_vectorized, y_train)

model.score(X_test_vec, y_test)

However I'm getting an error on the last line of code when I run it in the notebook.

ValueError: Found input variables with inconsistent numbers of samples: [495, 1980]

Full error message: https://i.sstatic.net/cff5w.jpg

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

胡大本事 2025-02-20 04:30:45

您正在使用X_Train进行火车和测试，这是您遇到错误的原因。

尝试：

vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)

X_test_vec = vectorizer.transform(X_test) # use the same vectorizer, do not define a new one

如下所示，我们不适合测试集。
但是*您仍然需要使用y_test使用x_test

you are using x_train for both train and test and is the reason you are getting the error.

try:

vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)

X_test_vec = vectorizer.transform(X_test) # use the same vectorizer, do not define a new one

As pointed out below, we dont fit the test set.
BUT* you still need to use the X_test with y_test

回复收藏 0 原文

~没有更多了~