使用 Python/NLTK 提取一组单词,然后将其与标准英语词典进行比较

发布于 2024-09-13 10:02:38 字数 755 浏览 7 评论 0原文

我有:

from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]

f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]

这是直接来自 NLTK 手册的。接下来我想做的是将 vocab 与一组详尽的英语单词(例如《牛津英语词典》)进行比较,并提取差异 - 芬尼根守灵夜中没有、也可能永远不会出现的单词集,出现在《牛津英语词典》中。我更像是一个语言型的人,而不是一个数学型的人,所以我还没有弄清楚如何做到这一点,并且手册对我实际上不想做的事情进行了太多细节。不过,我假设这只是一两行代码。

I have:

from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]

f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]

which is straight from the NLTK manual. What I want to do next is to compare vocab to an exhaustive set of English words, like the OED, and extract the difference -- the set of Finnegans Wake words that have not, and probably never will, be in the OED. I'm much more of a verbal person than a math-oriented person, so I haven't figured out how to do that yet, and the manual goes into way too much detail about stuff I don't actually want to do. I'm assuming it's just one or two more lines of code, though.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

沉睡月亮 2024-09-20 10:02:38

如果您的英语词典确实是一个集合(希望是小写单词),

set(vocab) - english_dictionary

则为您提供位于 vocab 集合中但不在 english_dictionary 集合中的单词集合。 (遗憾的是,您通过 sortedvocab 转换为列表,因为您需要将其转换回集合来执行诸如此集合差异之类的操作!)。

如果您的英语词典采用某种不同的格式,而不是真正的集合或不仅仅由小写单词组成,您必须告诉我们该格式是什么,以便我们能够提供帮助!-)

编辑:鉴于OP的编辑显示,words(以前称为vocab)和englishwords(我以前称为english_dictionary< /code>)实际上是小写单词列表,然后

newwords = set(words) - set(englishwords)

或 是

newwords = set(words).difference(englishwords)

表达“非英语单词的单词集”的两种方式。前者稍微简洁一些,后者可能更具可读性(因为它明确使用“差异”一词,而不是减号),并且可能更高效(因为它没有显式转换列表 englishwords 到一个集合中 - 但是,如果速度至关重要,则需要通过测量来检查,因为“内部”差异仍然需要进行某种“转换为集合” -类似操作)。

如果您希望得到一个列表而不是一个集合作为结果,sorted(newwords) 将为您提供一个按字母顺序排序的列表(list(newwords) 将为您提供一个列表有点快,但完全任意的顺序,我怀疑你宁愿等待一点额外的时间,作为回报,得到一个很好的按字母顺序排列的结果;-)。

If your English dictionary is indeed a set (hopefully of lowercased words),

set(vocab) - english_dictionary

gives you the set of words which are in the vocab set but not in the english_dictionary one. (It's a pity that you turned vocab into a list by that sorted, since you need to turn it back into a set to perform operations such as this set difference!).

If your English dictionary is in some different format, not really a set or not comprised only of lowercased words, you'll have to tell us what that format is for us to be able to help!-)

Edit: given the OP's edit shows that both words (what was previously called vocab) and englishwords (what I previously called english_dictionary) are in fact lists of lowercased words, then

newwords = set(words) - set(englishwords)

or

newwords = set(words).difference(englishwords)

are two ways to express "the set of words that are not englishwords". The former is slightly more concise, the latter perhaps a bit more readable (since it uses the word "difference" explicitly, instead of a minus sign) and perhaps a bit more efficient (since it doesn't explicitly transform the list englishwords into a set -- though, if speed is crucial this needs to be checked by measurement, since "internally" difference still needs to do some kind of "transformation-to-set"-like operation).

If you're keen to have a list as the result instead of a set, sorted(newwords) will give you an alphabetically sorted list (list(newwords) would give you a list a bit faster, but in totally arbitrary order, and I suspect you'd rather wait a tiny extra amount of time and get, in return, a nicely alphabetized result;-).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文