如何编码我的数据中的逻辑回归,RF和GBDT
我想执行逻辑回归,但我不确定如何编码输入。我已经拆分了数据。
我的DF具有以下列:
doc_no,personal_no,令牌(使用Spacy进行预处理的文本),日和分数
数据类型:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 doc_no 30363 non-null int64
1 personal_no 30363 non-null int64
2 tokens 30363 non-null object
3 day 30363 non-null object
4 score 30363 non-null object
dtypes: int64(2), object(3)
令牌看起来像列表列表。 一天如下:5月3日,6月5日。 得分为0或1。
我想基于令牌预测得分。
我将数据分开:
columns = ['doc_no', 'personal_no', 'tokens', 'day', 'score']
df = df_new.loc[:, columns]
# arranging the data
features = ['doc_no', 'personal_no', 'tokens', 'day']
X = df.loc[:, features]
y = df.loc[:, ['score']]
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0,
train_size=0.8)
我知道我需要使用我认为的TF-IDF矢量化令牌?但这还没有工作。当我尝试运行此问题时:
model = LogisticRegression().fit(X_train, y_train)
我会收到此错误:
ValueError: could not convert string to float: "['word', 'word', 'word', 'word']
我将内容更改为“ Word”,因为我的数据对隐私敏感。
我该怎么办?
我还想在此数据上执行随机的森林和梯度增强决策树。我还需要考虑这些算法吗?
提前致谢!
I want to perform logistic regression but I'm not sure how to encode the input. I already split the data.
My df has the following columns:
doc_no, personal_no, tokens (preprocessed texts using spacy), day and score
Data types:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 doc_no 30363 non-null int64
1 personal_no 30363 non-null int64
2 tokens 30363 non-null object
3 day 30363 non-null object
4 score 30363 non-null object
dtypes: int64(2), object(3)
The tokens look like a list of lists.
Day is as follows: MAY03, JUN05 etc.
Score is 0 or 1.
I want to predict score based on tokens.
I split the data:
columns = ['doc_no', 'personal_no', 'tokens', 'day', 'score']
df = df_new.loc[:, columns]
# arranging the data
features = ['doc_no', 'personal_no', 'tokens', 'day']
X = df.loc[:, features]
y = df.loc[:, ['score']]
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0,
train_size=0.8)
I know I need to vectorize the tokens, using TF-IDF I think? But that's not working so far. When I try to run this:
model = LogisticRegression().fit(X_train, y_train)
I get this error:
ValueError: could not convert string to float: "['word', 'word', 'word', 'word']
I changed the contents to 'word' because my data is privacy sensitive.
What do I do?
I also want to perform random forest and gradient boosted decision tree on this data. Would there be other things I need to take into account for those algorithms?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
TFIDFVECTORIZER(我想您谈论Sklearn中的那个)。因此,如果您的文本已经被标记了,则不能使用它。
您可以应用自己的公式,这很简单,也可以做以下操作:
NB:我认为您从Spacy获得的令牌与Sklearn所获得的代币不同。否则,只需跳过令牌化并直接在文本上工作即可。
TfidfVectorizer (I suppose you talk about the one from sklearn) works from texts. So if your text is already tokenized, you can't use that.
You can apply yourself the formula, which is simple enough, or you can do the following:
NB: I assume that the tokens you get from spacy are different from those you'd get with sklearn. Otherwise, just skip the tokenization and work directly on your texts.