如何在nltk.corpus.words.words()中求解丢失的单词?
我试图从文本中删除非英语单词。问题是许多其他单词来自NLTK单词语料库。
我的代码:
import pandas as pd
lst = ['I have equipped my house with a new [xxx] HP203X climatisation unit']
df = pd.DataFrame(lst, columns=['Sentences'])
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())
df['Sentences'] = df['Sentences'].apply(lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in (words)))
df
输入:我已经为我的房子配备了新的[xxx] HP203X气候化单位
结果:我有一个新单元的房子
应该是:我已经为我的房子配备了一个新的气候单位
我不知道如何完成nltk.corpus.words.words()
避免使用配备的单词
,climatisation
将从句子中恢复。
I have tried to remove non-English words from a text. Problem many other words are absent from the NLTK words corpus.
My code:
import pandas as pd
lst = ['I have equipped my house with a new [xxx] HP203X climatisation unit']
df = pd.DataFrame(lst, columns=['Sentences'])
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())
df['Sentences'] = df['Sentences'].apply(lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in (words)))
df
Input: I have equipped my house with a new [xxx] HP203X climatisation unit
Result: I have my house with a new unit
Should have been: I have equipped my house with a new climatisation unit
I can't figure out how to complete nltk.corpus.words.words()
to avoid words like equipped
, climatisation
to be remouved from the sentences.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在此处使用
,
Words
是一组,这就是为什么.extend(word_list)
不起作用的原因。You can use
Here,
words
is a set, that is why.extend(word_list)
did not work.