如何从 Wordnet 中获取按出现概率排序的同义词
我正在 Wordnet 中搜索一大堆单词的同义词。按照我的方式,当某个单词有多个同义词时,结果按字母顺序返回。我需要的是让它们按出现概率排序,并且我只取前 1 个同义词。
我使用了prolog wordnet数据库和Syns2Index将其转换为Lucene类型索引来查询同义词。有没有办法让它们按这种方式按概率排序,或者我应该使用另一种方法?
速度并不重要,同义词查找不会在线完成。
I am searching in Wordnet for synonyms for a big list of words. The way I have it done it, when some word has more than one synonym, the results are returned in alphabetical order. What I need is to have them ordered by their probability of occurrence, and I would take just the top 1 synonym.
I have used the prolog wordnet database and Syns2Index to convert it into Lucene type index for querying synonyms. Is there a way to get them ordered by their probabilities in this way, or I should use another approach?
Speed not important, this synonym lookup will not be done online.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果有人偶然发现这个线程,这就是要走的路(至少是我需要的):
http://lyle.smu.edu/~tspell/jaws/doc/edu/smu/tspell/wordnet /impl/file/ReferenceSynset.html#getTagCount%28java.lang.String%29
tagCount 方法给出每个单词最可能的同义词集组。问题又是概率最高的同义词集可以有几个单词。但我想没有机会避免这个
In case someone stumbles upon this thread, this was the way to go(at least what i needed):
http://lyle.smu.edu/~tspell/jaws/doc/edu/smu/tspell/wordnet/impl/file/ReferenceSynset.html#getTagCount%28java.lang.String%29
tagCount method gives the most likely synset group for every word. The problem again is that synset with highes probability again can have several words. But i guess theres no chance to avoid this
我认为你应该再做一步(前提是速度并不重要)。
从 Lucene 索引中,您应该构建另一个字典,其中每个单词都映射到一个小对象,该对象包含唯一的同义词,其含义具有更高的出现概率,其含义和出现概率。即,给定以下代码:
...您只需从 Lucene 索引中填充它即可。
I think that you should do another step (provided that speed is not important).
From the Lucene index, you should build another dictionary in which each word is mapped to a small object that contains the only synonym that its meaning has higher probability of appearance, its meaning, and probability of appearance. I.e., given this code:
... you just have to fill it from the Lucene index.