使用 Python/NLTK 提取一组单词，然后将其与标准英语词典进行比较

发布于 2024-09-13 10:02:38 字数 755 浏览 7 评论 0原文

我有：

from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]

f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]

这是直接来自 NLTK 手册的。接下来我想做的是将 vocab 与一组详尽的英语单词（例如《牛津英语词典》）进行比较，并提取差异 - 芬尼根守灵夜中没有、也可能永远不会出现的单词集，出现在《牛津英语词典》中。我更像是一个语言型的人，而不是一个数学型的人，所以我还没有弄清楚如何做到这一点，并且手册对我实际上不想做的事情进行了太多细节。不过，我假设这只是一两行代码。

原文

I have:

from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]

f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]

which is straight from the NLTK manual. What I want to do next is to compare vocab to an exhaustive set of English words, like the OED, and extract the difference -- the set of Finnegans Wake words that have not, and probably never will, be in the OED. I'm much more of a verbal person than a math-oriented person, so I haven't figured out how to do that yet, and the manual goes into way too much detail about stuff I don't actually want to do. I'm assuming it's just one or two more lines of code, though.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉睡月亮 2024-09-20 10:02:38

如果您的英语词典确实是一个集合（希望是小写单词），

set(vocab) - english_dictionary

则为您提供位于 vocab 集合中但不在 english_dictionary 集合中的单词集合。（遗憾的是，您通过 sorted 将 vocab 转换为列表，因为您需要将其转换回集合来执行诸如此集合差异之类的操作！）。

如果您的英语词典采用某种不同的格式，而不是真正的集合或不仅仅由小写单词组成，您必须告诉我们该格式是什么，以便我们能够提供帮助！-)

编辑：鉴于OP的编辑显示，words（以前称为vocab）和englishwords（我以前称为english_dictionary< /code>）实际上是小写单词列表，然后

newwords = set(words) - set(englishwords)

或是

newwords = set(words).difference(englishwords)

表达“非英语单词的单词集”的两种方式。前者稍微简洁一些，后者可能更具可读性（因为它明确使用“差异”一词，而不是减号），并且可能更高效（因为它没有显式转换列表 englishwords 到一个集合中 - 但是，如果速度至关重要，则需要通过测量来检查，因为“内部”差异仍然需要进行某种“转换为集合” -类似操作）。

如果您希望得到一个列表而不是一个集合作为结果，sorted(newwords) 将为您提供一个按字母顺序排序的列表（list(newwords) 将为您提供一个列表有点快，但完全任意的顺序，我怀疑你宁愿等待一点额外的时间，作为回报，得到一个很好的按字母顺序排列的结果;-)。

If your English dictionary is indeed a set (hopefully of lowercased words),

set(vocab) - english_dictionary

gives you the set of words which are in the vocab set but not in the english_dictionary one. (It's a pity that you turned vocab into a list by that sorted, since you need to turn it back into a set to perform operations such as this set difference!).

If your English dictionary is in some different format, not really a set or not comprised only of lowercased words, you'll have to tell us what that format is for us to be able to help!-)

Edit: given the OP's edit shows that both words (what was previously called vocab) and englishwords (what I previously called english_dictionary) are in fact lists of lowercased words, then

newwords = set(words) - set(englishwords)

newwords = set(words).difference(englishwords)

are two ways to express "the set of words that are not englishwords". The former is slightly more concise, the latter perhaps a bit more readable (since it uses the word "difference" explicitly, instead of a minus sign) and perhaps a bit more efficient (since it doesn't explicitly transform the list englishwords into a set -- though, if speed is crucial this needs to be checked by measurement, since "internally" difference still needs to do some kind of "transformation-to-set"-like operation).

If you're keen to have a list as the result instead of a set, sorted(newwords) will give you an alphabetically sorted list (list(newwords) would give you a list a bit faster, but in totally arbitrary order, and I suspect you'd rather wait a tiny extra amount of time and get, in return, a nicely alphabetized result;-).

回复收藏 0 原文

~没有更多了~