如何用Python检查一个单词是否是英文单词?

发布于 2024-09-24 11:56:47 字数 297 浏览 7 评论 0 原文

我想在 Python 程序中检查一个单词是否在英语词典中。

我相信 nltk wordnet 接口可能是可行的方法,但我不知道如何使用它来完成如此简单的任务。

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

将来,我可能想检查字典中是否存在单词的单数形式(例如,属性 -> 属性 -> 英语单词)。我将如何实现这一目标?

I want to check in a Python program if a word is in the English dictionary.

I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

友欢 2024-10-01 11:56:47

为了获得(更多)更强大的功能和灵活性,请使用专用的拼写检查库,例如 PyEnchant。有一个教程,或者您可以直接进入:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant 附带一些词典(en_GB、en_US、de_DE、fr_FR),但可以使用任何 OpenOffice如果您想要更多语言,请使用

似乎有一个名为 inflect 的复数库,但我已经不知道这是否有好处。

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There's a tutorial, or you could just dive straight in:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.

There appears to be a pluralisation library called inflect, but I've no idea whether it's any good.

池予 2024-10-01 11:56:47

它不能很好地与 WordNet 配合使用,因为 WordNet 并不包含所有英语单词。
另一种基于 NLTK 且没有附魔的可能性是 NLTK 的单词语料库

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

It won't work well with WordNet, because WordNet does not contain all english words.
Another possibility based on NLTK without enchant is NLTK's words corpus

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True
∞觅青森が 2024-10-01 11:56:47

使用NLTK

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

您应该参考本文

Using NLTK:

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

You should refer to this article if you have trouble installing wordnet or want to try other approaches.

红衣飘飘貌似仙 2024-10-01 11:56:47

使用集合来存储单词列表,因为查找它们会更快:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

要回答问题的第二部分,复数形式已经在一个好的单词列表中,但如果您出于某种原因想从列表中专门排除它们,你确实可以编写一个函数来处理它。但英语复数规则非常棘手,我一开始就将复数包含在单词列表中。

至于在哪里可以找到英语单词列表,我通过谷歌搜索“英语单词列表”找到了几个。这是一个: http://www.sil.org/linguistics/wordlists /english/wordlist/wordsEn.txt 如果您特别需要其中一种方言,可以在 Google 上搜索英式英语或美式英语。

Using a set to store the word list because looking them up will be faster:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I'd just include the plurals in the word list to begin with.

As to where to find English word lists, I found several just by Googling "English word list". Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.

无边思念无边月 2024-10-01 11:56:47

对于所有 Linux/Unix 用户

如果您的操作系统使用 Linux 内核,则有一种简单的方法可以从英语/美国词典中获取所有单词。在目录 /usr/share/dict 中,您有一个 words 文件。还有更具体的 american-englishbritish-english 文件。这些包含该特定语言中的所有单词。您可以通过每种编程语言访问它,这就是为什么我认为您可能想了解这一点。

现在,对于 python 特定用户,下面的 python 代码应该分配列表单词以具有每个单词的值:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()
file.close()
    
def is_word(word):
    return word.lower() in words
 
is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

希望这有帮助!

编辑:
如果您找不到 words 文件或类似文件,请参阅下面 Phil 博士的评论。

For All Linux/Unix Users

If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.

Now, for python specific users, the python code below should assign the list words to have the value of every single word:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()
file.close()
    
def is_word(word):
    return word.lower() in words
 
is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

Hope this helps!

EDIT:
If you can't find the words file or something similiar, see the comment from Dr Phil below.

十雾 2024-10-01 11:56:47

我发现有3种基于包的解决方案可以解决这个问题。它们是 pyenchant、wordnet 和 corpus(自定义或来自 ntlk)。 Pyenchant 无法使用 py3 在 win64 中轻松安装。 Wordnet 工作得不太好,因为它的语料库不完整。所以对我来说,我选择@Sadik回答的解决方案,并使用“set(words.words())”来加速。

首先:

pip3 install nltk
python3

import nltk
nltk.download('words')

然后:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn't installed easily in win64 with py3. Wordnet doesn't work very well because it's corpus isn't complete. So for me, I choose the solution answered by @Sadik, and use 'set(words.words())' to speed up.

First:

pip3 install nltk
python3

import nltk
nltk.download('words')

Then:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True
岁月苍老的讽刺 2024-10-01 11:56:47

对于更快的基于 NLTK 的解决方案,您可以对单词集进行哈希处理以避免线性搜索。

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

For a faster NLTK-based solution you could hash the set of words to avoid a linear search.

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False
月亮坠入山谷 2024-10-01 11:56:47

使用 pyEnchant.checker 拼写检查器:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

With pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True
清醇 2024-10-01 11:56:47

上述库都不包含所有英文单词,因此我从链接导入了一个包含所有英文单词的csv文件:-->
https://github.com/dwyl/english-words

简单地将其变成 < code>pandas dataframe 并比较它们

None of the above libraries contains all english words, so I imported a csv file containing all english words from link:-->
https://github.com/dwyl/english-words

And simply made that into a pandas dataframe and compared them

∞觅青森が 2024-10-01 11:56:47

对于语义 Web 方法,您可以针对 RDF 格式的 WordNet 运行 sparql 查询。基本上只是使用 urllib 模块发出 GET 请求并以 JSON 格式返回结果,使用 python 'json' 模块进行解析。如果不是英文单词,您将不会得到任何结果。

作为另一个想法,您可以查询 维基词典的 API

For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python 'json' module. If it's not English word you'll get no results.

As another idea, you could query Wiktionary's API.

明媚如初 2024-10-01 11:56:47

使用 nltk.corpus 而不是 enchant。附魔给出了不明确的结果。例如 :
对于基准和基准附魔正在返回 true。它应该假设返回 false 作为基准。

use nltk.corpus instead of enchant. Enchant gives ambiguous results. For example :
for benchmark and bench-mark enchant is returning true. It should suppose to return false for benchmark.

和影子一齐双人舞 2024-10-01 11:56:47

下载此 txt 文件 https://raw.githubusercontent.com/dwyl/ english-words/master/words_alpha.txt

然后使用以下 python 代码片段创建一个 Set ,该代码片段加载大约 370k 英语非字母数字单词

>>> with open("/PATH/TO/words_alpha.txt") as f:
>>>     words = set(f.read().split('\n'))
>>> len(words)
370106

从这里开始,您可以检查 注意,该集合可能

>>> word_to_check = 'baboon'
>>> word_to_check in words
True

并不全面,但仍然可以完成工作,用户应该进行质量检查以确保它也适用于他们的用例。

Download this txt file https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt

then create a Set out of it using the following python code snippet that loads about 370k non-alphanumeric words in english

>>> with open("/PATH/TO/words_alpha.txt") as f:
>>>     words = set(f.read().split('\n'))
>>> len(words)
370106

From here onwards, you can check for existence in constant time using

>>> word_to_check = 'baboon'
>>> word_to_check in words
True

Note that this set might not be comprehensive but still gets the job done, user should do quality checks to make sure it works for their use-case as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文