Wordnet Synset 偏移量?如何比较单词
我使用的是中央研究院的中文Wordnet。它是Wordnet 1.6 的翻译。不幸的是它不是免费提供的,必须购买,并且手册基本上说参考Wordnet的手册。我想弄清楚的是如何比较两个单词之间的相似度。我想它是用 WordNetSynsetOffset 完成的,但我在 Wordnet 网站上找不到任何内容,也找不到有关如何使用它来比较两个单词的文档。至于实际的算法,我认为这是一个好的开始 http://marimba.d.umn .edu/similarity/measures.html
<Record Conut="65">
<EnglishLemma>exercise</EnglishLemma>
<POS>Noun</POS>
<WordNetSynsetOffset Version="1.6">00469856</WordNetSynsetOffset>
<EnglishFrequancyRank>通用詞彙</EnglishFrequancyRank>
<ChineseTransList>
<ChineseTrans>
<ChineseLemma>例題</ChineseLemma>
<ChineseFrequancyRank>通用詞彙</ChineseFrequancyRank>
</ChineseTrans>
</ChineseTransList>
</Record>
I am using the Chinese Wordnet from Academic Sinica. It is a translation of Wordnet 1.6. Unfortunately it is not freely available, and has to be purchased, and the manual basically says refer to Wordnet's manual. What I am trying to figure out is how to compare the similarity between two words. I imagine it is done with the WordNetSynsetOffset but I could not find anything on the Wordnet website or documentation on how to use this to compare two words. As for the actual algorithms I suppose this is a good start http://marimba.d.umn.edu/similarity/measures.html
<Record Conut="65">
<EnglishLemma>exercise</EnglishLemma>
<POS>Noun</POS>
<WordNetSynsetOffset Version="1.6">00469856</WordNetSynsetOffset>
<EnglishFrequancyRank>通用詞彙</EnglishFrequancyRank>
<ChineseTransList>
<ChineseTrans>
<ChineseLemma>例題</ChineseLemma>
<ChineseFrequancyRank>通用詞彙</ChineseFrequancyRank>
</ChineseTrans>
</ChineseTransList>
</Record>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
所以我认为您正在寻找的(基于评论)是 WordNet API。
如果中文格式相同,您也许可以使用安装时附带的 WordNet API。它是一个 C 库,您可以在这里找到文档:
http://wordnet.princeton.edu/wordnet/ 原理
基本上 - 这是它的工作 。 Synset 是所标识的 Synset 的一组同义术语,由 Synset Id(00469856)唯一标识。同义词集通过各种形式的语义关系与其他同义词集相连。大多数相似性指标的工作原理是搜索一个 Synset(通过下面引用的数字,API 应该支持这一点),然后使用各种指标查看另一个 Synset 的距离有多远。
同义词集还包含同义词集语义的文本描述——我们习惯的标准字典定义。在某些情况下,某些相似性度量(例如 Lesk 算法)使用文本描述来比较两个同义词集彼此的“相似”程度。
还有其他可用的 API,允许您通过各种语言的 API 搜索和访问 WordNet。
http://wordnet.princeton.edu/wordnet/lated-projects/
例如,这里是来自 WordNet 3.0 词典文件的 Synset 定义示例:
00020671 29 v 04 hypnotize 0 hypnotise 0 mesmerize 0 mesmerise 0(...更多省略)...
唯一标识符 00020671 标识此 Synset。这里有四个催眠的同义词。
So I think what you are looking for (based on the comments), is the WordNet API.
If the Chinese format is the same, you might be able to use the WordNet API that shipped with your installation. It's a C library, you can find the documentation here:
http://wordnet.princeton.edu/wordnet/documentation/
Basically - here's how it works. A Synset is a group of synonymous terms for the synset identified, which is uniquely identified by the Synset Id (the 00469856). Synsets are connected to other synsets through various forms of semantic relations. Most of the similarity metrics work by searching for one Synset (by the number you referenced below, the API should support this), and then seeing how far away another Synset is by using various metrics.
A synset also contains a textual description of the semantic meaning of the synset - the standard dictionary definition we are used to. In some cases, some similarity metrics (such as the Lesk algorithm), uses the textual description to compare how "similar" two synsets are to each other.
There are other API's available that allow you to search and access WordNet through it's API in various languages.
http://wordnet.princeton.edu/wordnet/related-projects/
For instance, here is an example Synset definition from the WordNet 3.0 dictionary files:
00020671 29 v 04 hypnotize 0 hypnotise 0 mesmerize 0 mesmerise 0 (... more left out)...
The unique identifier 00020671 identifies this synset. There are four synonyms here for hypnotize.
一个词可以有多种可能的含义(同义词集)。如果您想比较两种含义之间的相似性,您首先必须消除每个单词的歧义。一旦您知道要比较哪两种感觉,您就可以使用@bwalenz 的建议。
A word could have many possible senses (synsets). If you want to compare similarity between two senses, you'll first have to disambiguate each word. Once you know which two senses you're comparing, you can use what @bwalenz has suggested.