仅当在 python 2 中执行停用词删除时，标记化步骤中出现 Unicode 错误

发布于 2025-01-12 01:16:09 字数 1263 浏览 0 评论 0原文

我正在尝试运行此脚本：在此处输入链接描述（唯一的区别是，我需要读取我的数据集（列文本），而不是这个 TEST_SENTENCES。唯一的事情是，我需要先对该列应用停用词删除，然后再将其传递给其余部分 但是当我以这种方式使用数据框

df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
                            'The wireless internet was unreliable. ', 'i am still her . :). ',
                            'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
    'positive', 'negative', 'neutral', 'positive', 'neutral']})

时，不会出现错误，但是当我使用包含完全相同数据的 csv 文件时，则会出现错误。

但是当我添加这行代码时删除 stop_words

df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']

它不断引发此错误： ValueError：所有句子都应该是 Unicode 编码的！

另外，在标记化步骤中会引发错误：

tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)

我想知道这里发生了什么导致此错误，以及修复代码的正确解决方案。

（我尝试过不同的编码，例如 uff-8 等，但不起作用）

原文

I am trying to run this script: enter link description here
(The only difference is that instead of this TEST_SENTENCES I need to read my dataset (column text). The only thing is that I need to apply stop word removal to that column before passing that to the rest of the code.

df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
                            'The wireless internet was unreliable. ', 'i am still her . :). ',
                            'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
    'positive', 'negative', 'neutral', 'positive', 'neutral']})

But the error does not raises when I use the data frame in this way but when I use the csv file that includes the exact same data the error raises.

But when I add this line of the code to remove stop_words

df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']

It keeps raises this error:
ValueError: All sentences should be Unicode-encoded!

Also, the error raises in the tokenization step:

tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)

I want to know what is happening here that it causes this error, and the correct solution to fix the code.

(I have tried different encodings like uff-8 , etc but non worked)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

菊凝晚露 2025-01-19 01:16:09

我还不知道原因，但当我这样做时，

df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')

它起作用了。

仍然很好奇为什么只有当我删除停用词时才会发生这种情况

I don't know the reason yet but when I did

df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')

it worked.

Still very curious to know why this is happening only when I do stop words removal

回复收藏 0 原文

~没有更多了~

关于作者

平安喜乐

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

仅当在 python 2 中执行停用词删除时，标记化步骤中出现 Unicode 错误

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

仅当在 python 2 中执行停用词删除时，标记化步骤中出现 Unicode 错误

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。