仅当在 python 2 中执行停用词删除时,标记化步骤中出现 Unicode 错误
我正在尝试运行此脚本:在此处输入链接描述 (唯一的区别是,我需要读取我的数据集(列文本),而不是这个 TEST_SENTENCES
。唯一的事情是,我需要先对该列应用停用词删除,然后再将其传递给其余部分 但是当我以这种方式使用数据框
df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
'The wireless internet was unreliable. ', 'i am still her . :). ',
'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
'positive', 'negative', 'neutral', 'positive', 'neutral']})
时,不会出现错误,但是当我使用包含完全相同数据的 csv 文件时,则会出现错误。
但是当我添加这行代码时删除 stop_words
df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']
它不断引发此错误: ValueError:所有句子都应该是 Unicode 编码的!
另外,在标记化步骤中会引发错误:
tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)
我想知道这里发生了什么导致此错误,以及修复代码的正确解决方案。
(我尝试过不同的编码,例如 uff-8 等,但不起作用)
I am trying to run this script: enter link description here
(The only difference is that instead of this TEST_SENTENCES
I need to read my dataset (column text). The only thing is that I need to apply stop word removal to that column before passing that to the rest of the code.
df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
'The wireless internet was unreliable. ', 'i am still her . :). ',
'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
'positive', 'negative', 'neutral', 'positive', 'neutral']})
But the error does not raises when I use the data frame in this way but when I use the csv file that includes the exact same data the error raises.
But when I add this line of the code to remove stop_words
df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']
It keeps raises this error:ValueError: All sentences should be Unicode-encoded!
Also, the error raises in the tokenization step:
tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)
I want to know what is happening here that it causes this error, and the correct solution to fix the code.
(I have tried different encodings like uff-8
, etc but non worked)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我还不知道原因,但当我这样做时,
它起作用了。
仍然很好奇为什么只有当我删除停用词时才会发生这种情况
I don't know the reason yet but when I did
it worked.
Still very curious to know why this is happening only when I do
stop words removal