如何用Python检查一个单词是否是英文单词?
我想在 Python 程序中检查一个单词是否在英语词典中。
我相信 nltk wordnet 接口可能是可行的方法,但我不知道如何使用它来完成如此简单的任务。
def is_english_word(word):
pass # how to I implement is_english_word?
is_english_word(token.lower())
将来,我可能想检查字典中是否存在单词的单数形式(例如,属性 -> 属性 -> 英语单词)。我将如何实现这一目标?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
为了获得(更多)更强大的功能和灵活性,请使用专用的拼写检查库,例如
PyEnchant
。有一个教程,或者您可以直接进入:PyEnchant
附带一些词典(en_GB、en_US、de_DE、fr_FR),但可以使用任何 OpenOffice如果您想要更多语言,请使用。似乎有一个名为
inflect
的复数库,但我已经不知道这是否有好处。For (much) more power and flexibility, use a dedicated spellchecking library like
PyEnchant
. There's a tutorial, or you could just dive straight in:PyEnchant
comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.There appears to be a pluralisation library called
inflect
, but I've no idea whether it's any good.它不能很好地与 WordNet 配合使用,因为 WordNet 并不包含所有英语单词。
另一种基于 NLTK 且没有附魔的可能性是 NLTK 的单词语料库
It won't work well with WordNet, because WordNet does not contain all english words.
Another possibility based on NLTK without enchant is NLTK's words corpus
使用NLTK:
您应该参考本文。
Using NLTK:
You should refer to this article if you have trouble installing wordnet or want to try other approaches.
使用集合来存储单词列表,因为查找它们会更快:
要回答问题的第二部分,复数形式已经在一个好的单词列表中,但如果您出于某种原因想从列表中专门排除它们,你确实可以编写一个函数来处理它。但英语复数规则非常棘手,我一开始就将复数包含在单词列表中。
至于在哪里可以找到英语单词列表,我通过谷歌搜索“英语单词列表”找到了几个。这是一个: http://www.sil.org/linguistics/wordlists /english/wordlist/wordsEn.txt 如果您特别需要其中一种方言,可以在 Google 上搜索英式英语或美式英语。
Using a set to store the word list because looking them up will be faster:
To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I'd just include the plurals in the word list to begin with.
As to where to find English word lists, I found several just by Googling "English word list". Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.
对于所有 Linux/Unix 用户
如果您的操作系统使用 Linux 内核,则有一种简单的方法可以从英语/美国词典中获取所有单词。在目录
/usr/share/dict
中,您有一个words
文件。还有更具体的american-english
和british-english
文件。这些包含该特定语言中的所有单词。您可以通过每种编程语言访问它,这就是为什么我认为您可能想了解这一点。现在,对于 python 特定用户,下面的 python 代码应该分配列表单词以具有每个单词的值:
希望这有帮助!
编辑:
如果您找不到
words
文件或类似文件,请参阅下面 Phil 博士的评论。For All Linux/Unix Users
If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory
/usr/share/dict
you have awords
file. There is also a more specificamerican-english
andbritish-english
files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.Now, for python specific users, the python code below should assign the list words to have the value of every single word:
Hope this helps!
EDIT:
If you can't find the
words
file or something similiar, see the comment from Dr Phil below.我发现有3种基于包的解决方案可以解决这个问题。它们是 pyenchant、wordnet 和 corpus(自定义或来自 ntlk)。 Pyenchant 无法使用 py3 在 win64 中轻松安装。 Wordnet 工作得不太好,因为它的语料库不完整。所以对我来说,我选择@Sadik回答的解决方案,并使用“set(words.words())”来加速。
首先:
然后:
I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn't installed easily in win64 with py3. Wordnet doesn't work very well because it's corpus isn't complete. So for me, I choose the solution answered by @Sadik, and use 'set(words.words())' to speed up.
First:
Then:
对于更快的基于 NLTK 的解决方案,您可以对单词集进行哈希处理以避免线性搜索。
For a faster NLTK-based solution you could hash the set of words to avoid a linear search.
使用 pyEnchant.checker 拼写检查器:
With pyEnchant.checker SpellChecker:
上述库都不包含所有英文单词,因此我从链接导入了一个包含所有英文单词的csv文件:-->
https://github.com/dwyl/english-words
简单地将其变成 < code>pandas dataframe 并比较它们
None of the above libraries contains all english words, so I imported a csv file containing all english words from link:-->
https://github.com/dwyl/english-words
And simply made that into a
pandas dataframe
and compared them对于语义 Web 方法,您可以针对 RDF 格式的 WordNet 运行 sparql 查询。基本上只是使用 urllib 模块发出 GET 请求并以 JSON 格式返回结果,使用 python 'json' 模块进行解析。如果不是英文单词,您将不会得到任何结果。
作为另一个想法,您可以查询 维基词典的 API。
For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python 'json' module. If it's not English word you'll get no results.
As another idea, you could query Wiktionary's API.
使用 nltk.corpus 而不是 enchant。附魔给出了不明确的结果。例如 :
对于基准和基准附魔正在返回 true。它应该假设返回 false 作为基准。
use nltk.corpus instead of enchant. Enchant gives ambiguous results. For example :
for benchmark and bench-mark enchant is returning true. It should suppose to return false for benchmark.
下载此 txt 文件 https://raw.githubusercontent.com/dwyl/ english-words/master/words_alpha.txt
然后使用以下 python 代码片段创建一个
Set
,该代码片段加载大约 370k 英语非字母数字单词从这里开始,您可以检查 注意,该集合可能
并不全面,但仍然可以完成工作,用户应该进行质量检查以确保它也适用于他们的用例。
Download this txt file https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt
then create a
Set
out of it using the following python code snippet that loads about 370k non-alphanumeric words in englishFrom here onwards, you can check for existence in constant time using
Note that this set might not be comprehensive but still gets the job done, user should do quality checks to make sure it works for their use-case as well.