RandomForestRegressor:发现的输入变量,示例数量不一致

发布于 2025-02-13 04:30:45 字数 1073 浏览 1 评论 0原文

这是一个即将到期的项目,因此将不胜感激,我从未做过ML,因此很抱歉,如果错误是绝对光滑的大脑。

我有一个数据集,该数据集以及个性分数以及个性分数,我需要训练一个模型来预测分数。 到目前为止,这就是我所做的,通过遵循大量教程并将我学到的东西缝合在一起。

train = pandas.read_csv('../dataset/cleaner_dataset.csv')
train['tweet'] = train['tweet'].str.lower()
train['tweet'] = train['tweet'].replace('[^a-zA-Z0-9]', ' ', regex = True)

X = train['tweet']
y = train['neuroticism']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)

vectorizer = TfidfVectorizer(min_df=5)
X_test_vec = vectorizer.fit_transform(X_train) 

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_vectorized, y_train)

model.score(X_test_vec, y_test)

但是,当我在笔记本中运行时,我在最后一行的代码上会遇到错误。

ValueError: Found input variables with inconsistent numbers of samples: [495, 1980]

完整错误消息: https://i.sstatic.net/cff5w.jpg

This is for a project that's due soon so help would be greatly appreciated, I've never done ML before so sorry if the mistake is an absolute smooth brain one.

I have a dataset that's a bunch of tweets along with personality scores, and I need to train an model to predict the scores.
This is what I've done so far by following a bunch of tutorials and stitching together what I learned.

train = pandas.read_csv('../dataset/cleaner_dataset.csv')
train['tweet'] = train['tweet'].str.lower()
train['tweet'] = train['tweet'].replace('[^a-zA-Z0-9]', ' ', regex = True)

X = train['tweet']
y = train['neuroticism']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)

vectorizer = TfidfVectorizer(min_df=5)
X_test_vec = vectorizer.fit_transform(X_train) 

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_vectorized, y_train)

model.score(X_test_vec, y_test)

However I'm getting an error on the last line of code when I run it in the notebook.

ValueError: Found input variables with inconsistent numbers of samples: [495, 1980]

Full error message: https://i.sstatic.net/cff5w.jpg

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

胡大本事 2025-02-20 04:30:45

您正在使用X_Train进行火车和测试,这是您遇到错误的原因。

尝试:

vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)

X_test_vec = vectorizer.transform(X_test) # use the same vectorizer, do not define a new one

如下所示,我们不适合测试集。
但是*您仍然需要使用y_test使用x_test

you are using x_train for both train and test and is the reason you are getting the error.

try:

vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)

X_test_vec = vectorizer.transform(X_test) # use the same vectorizer, do not define a new one

As pointed out below, we dont fit the test set.
BUT* you still need to use the X_test with y_test

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文