nltk.corpus.wordnet 的哪个相似度函数适合查找两个单词的相似度?
nltk.corpus.wordnet
中的哪个相似度函数适合查找两个单词的相似度?
path_similarity()?
lch_similarity()?
wup_similarity()?
res_similarity()?
jcn_similarity()?
lin_similarity()?
我想使用单词聚类
函数和yarowsky
算法在大文本中查找相似的搭配
。
which similarity function in nltk.corpus.wordnet
is Appropriate for find similarity of two words?
path_similarity()?
lch_similarity()?
wup_similarity()?
res_similarity()?
jcn_similarity()?
lin_similarity()?
I want use a function for word clustering
and yarowsky
algorightm for find similar collocation
in a large text.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这些衡量标准实际上是针对词义(或概念)而不是词的。这种区别可能很重要。换句话说,“火车”一词可以表示“机车”或“被教导做某事”。要使用这些措施,您需要知道其意图是什么。
如果您想做单词聚类,这些措施可能并不完全是您想要的......
These measure are actually for word senses (or concepts) not words. That distinction might matter. In other words, the word "train" can mean "locomotive" or "being taught to do something". To use these measures you'd need to know which sense was intended.
If you want to do word clustering, these measures might not be exactly what you want...
我自己一直在使用 NLTK/wordnet,试图以某种自动方式匹配一些文本。正如 Ted Pedersen 的回答所指出的那样,很快就会清楚,nltk.corpus.wordnet 中的相似性函数只会为具有可靠 IS-A 谱系的非常密切相关的术语产生非零相似性。
我最终做的是获取文本中的词汇,然后使用 lemma->synset->lemmas 和 lemma->similar_tos 来生成我自己的单词链接图(
graph_tool
对此非常棒),然后计算 链接 2 个单词所需的最小跳数它们之间的某种(不)相似性度量(将它们打印出来非常有趣;就像观看一个非常奇怪的单词联想游戏)。即使没有尝试考虑 POS/意义,这实际上也足以满足我的目的。I've been playing with NLTK/wordnet myself for the purposes of trying to match up some texts in some automatic way. As Ted Pedersen's answer notes, it pretty quickly becomes clear that the similarity functions in
nltk.corpus.wordnet
only produce non-zero similarities for quite closely related terms with a solid IS-A pedigree.What I ended up doing was taking the vocabulary in my texts, and then using lemma->synset->lemmas and lemma->similar_tos to grow my own word linkage graph (
graph_tool
fantastic for this) and then counting the minimum number of hops needed to link 2 words to get some sort of (dis-)similarity measure between them (quite entertaining to print these out; like watching a very bizarre word-association game). This did actually work well enough for my purposes even without any attempt to take POS/sense into account.