Word Net - 单词同义词和同义词相关单词构造 - Java 或 Python
我希望使用 WordNet 从一组基本术语中查找相似术语的集合。
例如,单词“不鼓励” - 潜在同义词可能是:畏惧、忧郁、受阻、悲观
。
我还想识别潜在的二元语法,例如: 击败、推迟、屈服
等。
我如何使用 Java 或 Python 提取这些信息?是否有任何托管的 WordNet 数据库/Web 界面允许此类查询?
谢谢!
I am looking to use WordNet to look for a collection of like terms from a base set of terms.
For example, the word 'discouraged' - potential synonyms could be: daunted, glum, deterred, pessimistic
.
I also wanted to identify potential bi-grams such as; beat down, put off, caved in
etc.
How do I go about extracting this information using Java or Python? Are there any hosted WordNet databases/web interfaces which would allow such querying?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
通过查看最容易理解 WordNet 数据
在 Prolog 文件中。它们记录在此处:
http://wordnet.princeton.edu/wordnet/man/ prologdb.5WN.html
WordNet 术语分为同义词集。同义词集是一个最大的
同义词集。同义词集有一个主键,以便可以使用它们
在语义关系中。
因此,回答你的第一个问题,你可以列出不同的
单词的含义和相应的同义词如下:
示例:
对于问题的第二部分,WordNet 术语是
单词序列。所以你可以搜索这个WordNet术语
对于如下单词:
示例:
这将为您提供潜在的 n 元语法,但没有那么多
形态变异。 WordNet 也展示了一些
词汇关系,这可能很有用。
但我给出的两个 Prolog 查询都不是很有效。
问题是缺少一些单词索引。爪哇
实施当然可以实现更好的东西。
想象一下:
一些 Prolog 可以通过索引指令做同样的事情,它是
可以指示 Prolog 系统对多个索引进行索引
谓词的参数。
建立网络服务也不应该那么困难
在 Java 或 Prolog 中。许多 Prologs 系统很容易允许嵌入
Web 服务器中的 Prolog 程序和 Java 支持 servlet。
支持 Web 服务器的 Prolog 列表可以在这里找到:
http://en.wikipedia。 org/wiki/Comparison_of_Prolog_implementations#Operating_system_and_Web-lated_features
最好的问候
It is easiest to understand the WordNet data by looking
at the Prolog files. They are documented here:
http://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html
WordNet terms are group into synsets. A synset is a maximal
synonym set. Synsets have a primary key so that they can be used
in semantic relationships.
So answering your first question, you can list the different
senses and corresponding synonyms of a word as follows:
Example:
For the second part of your question, WordNet terms are
sequences of words. So you can search this WordNet terms
for words as follows:
Example:
This would give you potential n-grams, but no so much
morphological variation. WordNet does also exhibit some
lexical relations, which could be useful.
But both Prolog queries I have given are not very efficient.
The problem is the lack of some word indexing. A Java
implementation could of course implement something better.
Just imagine something along:
Some Prolog can do the same, by a indexing directive, it is
possible to instruct the Prolog system to index on multiple
arguments for a predicate.
Putting up a web service shouldn't be that difficult, either
in Java or Prolog. Many Prologs systems easily allow embedding
Prolog programs in web servers, and Java champions servlets.
A list of Prologs that support web servers can be found here:
http://en.wikipedia.org/wiki/Comparison_of_Prolog_implementations#Operating_system_and_Web-related_features
Best Regards
这是两个不同的问题。
1)Wordnet和Python。使用 NLTK,它有一个很好的wordnet 接口。你可以自己写点东西,但说实话,为什么要让生活变得困难呢? Lingpipe 可能也有内置的东西,但 NLTK 更容易使用。我认为 nltk 只是下载一个 ntlk 数据库,但我很确定有 api 可以与 wordnet 对话。
2) 要在 nltk 中获取二元组,请按照此操作教程。一般来说,您对文本进行标记,然后通过向前和向后迭代句子来获取每个单词的所有 n 元语法。
These are two different problems.
1) Wordnet and python. Use NLTK, it has a nice interface to wordnet. You could write something on your own, but honestly why make life difficult? Lingpipe probably also has something built in but NLTK is much easier to use. I think nltk just downloads an ntlk database, but I'm pretty sure there are apis to talk to wordnet.
2) To get bigrams in nltk follow this tutorial. In general you tokenize text and then just iterate over the sentence getting all the n-grams for each word by looking forward and backward.
作为 NLTK 的替代方案,您可以使用一个可用的 WordNet SPARQL 端点来检索此类信息。查询示例:
在 Java 世界中, Jena 和 Sesame框架。
As alternative to NLTK, you can use one of available WordNet SPARQL endpoints to retrieve such information. Query example:
In Java universe, Jena and Sesame frameworks can be used.