如何编码我的数据中的逻辑回归，RF和GBDT

发布于 2025-02-07 02:53:17 字数 1336 浏览 0 评论 0原文

我想执行逻辑回归，但我不确定如何编码输入。我已经拆分了数据。

我的DF具有以下列：

doc_no，personal_no，令牌（使用Spacy进行预处理的文本），日和分数

数据类型：

 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   doc_no       30363 non-null  int64 
 1   personal_no  30363 non-null  int64 
 2   tokens       30363 non-null  object
 3   day          30363 non-null  object
 4   score        30363 non-null  object
dtypes: int64(2), object(3)

令牌看起来像列表列表。一天如下：5月3日，6月5日。得分为0或1。

我想基于令牌预测得分。

我将数据分开：

columns = ['doc_no', 'personal_no', 'tokens', 'day', 'score']

df = df_new.loc[:, columns]

# arranging the data
features =  ['doc_no', 'personal_no', 'tokens', 'day']

X = df.loc[:, features]
y = df.loc[:, ['score']]

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, 
                                                    train_size=0.8)

我知道我需要使用我认为的TF-IDF矢量化令牌？但这还没有工作。当我尝试运行此问题时：

model = LogisticRegression().fit(X_train, y_train)

我会收到此错误：

ValueError: could not convert string to float: "['word', 'word', 'word', 'word']

我将内容更改为“ Word”，因为我的数据对隐私敏感。

我该怎么办？

我还想在此数据上执行随机的森林和梯度增强决策树。我还需要考虑这些算法吗？

提前致谢！

原文

I want to perform logistic regression but I'm not sure how to encode the input. I already split the data.

My df has the following columns:

doc_no, personal_no, tokens (preprocessed texts using spacy), day and score

Data types:

 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   doc_no       30363 non-null  int64 
 1   personal_no  30363 non-null  int64 
 2   tokens       30363 non-null  object
 3   day          30363 non-null  object
 4   score        30363 non-null  object
dtypes: int64(2), object(3)

The tokens look like a list of lists.
Day is as follows: MAY03, JUN05 etc.
Score is 0 or 1.

I want to predict score based on tokens.

I split the data:

columns = ['doc_no', 'personal_no', 'tokens', 'day', 'score']

df = df_new.loc[:, columns]

# arranging the data
features =  ['doc_no', 'personal_no', 'tokens', 'day']

X = df.loc[:, features]
y = df.loc[:, ['score']]

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, 
                                                    train_size=0.8)

I know I need to vectorize the tokens, using TF-IDF I think? But that's not working so far. When I try to run this:

model = LogisticRegression().fit(X_train, y_train)

I get this error:

ValueError: could not convert string to float: "['word', 'word', 'word', 'word']

I changed the contents to 'word' because my data is privacy sensitive.

What do I do?

I also want to perform random forest and gradient boosted decision tree on this data. Would there be other things I need to take into account for those algorithms?

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

離殇 2025-02-14 02:53:17

TFIDFVECTORIZER（我想您谈论Sklearn中的那个）。因此，如果您的文本已经被标记了，则不能使用它。

您可以应用自己的公式，这很简单，也可以做以下操作：

tokens = [" ".join(ts) for ts in tokens]
tfifd_scores = TfidfVectorizer().fit_transform(tokens)

NB：我认为您从Spacy获得的令牌与Sklearn所获得的代币不同。否则，只需跳过令牌化并直接在文本上工作即可。

TfidfVectorizer (I suppose you talk about the one from sklearn) works from texts. So if your text is already tokenized, you can't use that.

You can apply yourself the formula, which is simple enough, or you can do the following:

tokens = [" ".join(ts) for ts in tokens]
tfifd_scores = TfidfVectorizer().fit_transform(tokens)

NB: I assume that the tokens you get from spacy are different from those you'd get with sklearn. Otherwise, just skip the tokenization and work directly on your texts.

回复收藏 0 原文

~没有更多了~