仅当在 python 2 中执行停用词删除时,标记化步骤中出现 Unicode 错误

发布于 2025-01-12 01:16:09 字数 1263 浏览 0 评论 0原文

我正在尝试运行此脚本:在此处输入链接描述 (唯一的区别是,我需要读取我的数据集(列文本),而不是这个 TEST_SENTENCES。唯一的事情是,我需要先对该列应用停用词删除,然后再将其传递给其余部分 但是当我以这种方式使用数据框

df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
                            'The wireless internet was unreliable. ', 'i am still her . :). ',
                            'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
    'positive', 'negative', 'neutral', 'positive', 'neutral']})

时,不会出现错误,但是当我使用包含完全相同数据的 csv 文件时,则会出现错误。

但是当我添加这行代码时删除 stop_words

df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']

它不断引发此错误: ValueError:所有句子都应该是 Unicode 编码的!

另外,在标记化步骤中会引发错误:

tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)

我想知道这里发生了什么导致此错误,以及修复代码的正确解决方案。

(我尝试过不同的编码,例如 uff-8 等,但不起作用)

I am trying to run this script: enter link description here
(The only difference is that instead of this TEST_SENTENCES I need to read my dataset (column text). The only thing is that I need to apply stop word removal to that column before passing that to the rest of the code.

df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
                            'The wireless internet was unreliable. ', 'i am still her . :). ',
                            'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
    'positive', 'negative', 'neutral', 'positive', 'neutral']})

But the error does not raises when I use the data frame in this way but when I use the csv file that includes the exact same data the error raises.

But when I add this line of the code to remove stop_words

df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']

It keeps raises this error:
ValueError: All sentences should be Unicode-encoded!

Also, the error raises in the tokenization step:

tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)

I want to know what is happening here that it causes this error, and the correct solution to fix the code.

(I have tried different encodings like uff-8 , etc but non worked)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

菊凝晚露 2025-01-19 01:16:09

我还不知道原因,但当我这样做时,

df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')

它起作用了。

仍然很好奇为什么只有当我删除停用词时才会发生这种情况

I don't know the reason yet but when I did

df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')

it worked.

Still very curious to know why this is happening only when I do stop words removal

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文