使用 Python/NLTK 提取一组单词,然后将其与标准英语词典进行比较
我有:
from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]
f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]
这是直接来自 NLTK 手册的。接下来我想做的是将 vocab
与一组详尽的英语单词(例如《牛津英语词典》)进行比较,并提取差异 - 芬尼根守灵夜中没有、也可能永远不会出现的单词集,出现在《牛津英语词典》中。我更像是一个语言型的人,而不是一个数学型的人,所以我还没有弄清楚如何做到这一点,并且手册对我实际上不想做的事情进行了太多细节。不过,我假设这只是一两行代码。
I have:
from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]
f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]
which is straight from the NLTK manual. What I want to do next is to compare vocab
to an exhaustive set of English words, like the OED, and extract the difference -- the set of Finnegans Wake words that have not, and probably never will, be in the OED. I'm much more of a verbal person than a math-oriented person, so I haven't figured out how to do that yet, and the manual goes into way too much detail about stuff I don't actually want to do. I'm assuming it's just one or two more lines of code, though.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您的英语词典确实是一个集合(希望是小写单词),
则为您提供位于
vocab
集合中但不在english_dictionary
集合中的单词集合。 (遗憾的是,您通过sorted
将vocab
转换为列表,因为您需要将其转换回集合来执行诸如此集合差异之类的操作!)。如果您的英语词典采用某种不同的格式,而不是真正的集合或不仅仅由小写单词组成,您必须告诉我们该格式是什么,以便我们能够提供帮助!-)
编辑:鉴于OP的编辑显示,
words
(以前称为vocab
)和englishwords
(我以前称为english_dictionary< /code>)实际上是小写单词列表,然后
或 是
表达“非英语单词的单词集”的两种方式。前者稍微简洁一些,后者可能更具可读性(因为它明确使用“差异”一词,而不是减号),并且可能更高效(因为它没有显式转换列表
englishwords
到一个集合中 - 但是,如果速度至关重要,则需要通过测量来检查,因为“内部”差异
仍然需要进行某种“转换为集合” -类似操作)。如果您希望得到一个列表而不是一个集合作为结果,
sorted(newwords)
将为您提供一个按字母顺序排序的列表(list(newwords)
将为您提供一个列表有点快,但完全任意的顺序,我怀疑你宁愿等待一点额外的时间,作为回报,得到一个很好的按字母顺序排列的结果;-)。If your English dictionary is indeed a set (hopefully of lowercased words),
gives you the set of words which are in the
vocab
set but not in theenglish_dictionary
one. (It's a pity that you turnedvocab
into a list by thatsorted
, since you need to turn it back into a set to perform operations such as this set difference!).If your English dictionary is in some different format, not really a set or not comprised only of lowercased words, you'll have to tell us what that format is for us to be able to help!-)
Edit: given the OP's edit shows that both
words
(what was previously calledvocab
) andenglishwords
(what I previously calledenglish_dictionary
) are in fact lists of lowercased words, thenor
are two ways to express "the set of words that are not englishwords". The former is slightly more concise, the latter perhaps a bit more readable (since it uses the word "difference" explicitly, instead of a minus sign) and perhaps a bit more efficient (since it doesn't explicitly transform the list
englishwords
into a set -- though, if speed is crucial this needs to be checked by measurement, since "internally"difference
still needs to do some kind of "transformation-to-set"-like operation).If you're keen to have a list as the result instead of a set,
sorted(newwords)
will give you an alphabetically sorted list (list(newwords)
would give you a list a bit faster, but in totally arbitrary order, and I suspect you'd rather wait a tiny extra amount of time and get, in return, a nicely alphabetized result;-).