当前位置：文江博客话题详情

Word Net - 单词同义词和同义词相关单词构造 - Java 或 Python

发布于 2024-11-28 14:50:23 字数 252 浏览 13 评论 0原文

我希望使用 WordNet 从一组基本术语中查找相似术语的集合。

例如，单词“不鼓励” - 潜在同义词可能是：畏惧、忧郁、受阻、悲观。

我还想识别潜在的二元语法，例如： 击败、推迟、屈服等。

我如何使用 Java 或 Python 提取这些信息？是否有任何托管的 WordNet 数据库/Web 界面允许此类查询？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里人 2024-12-05 14:50:23

通过查看最容易理解 WordNet 数据
在 Prolog 文件中。它们记录在此处：

http://wordnet.princeton.edu/wordnet/man/ prologdb.5WN.html

WordNet 术语分为同义词集。同义词集是一个最大的
同义词集。同义词集有一个主键，以便可以使用它们
在语义关系中。

因此，回答你的第一个问题，你可以列出不同的
单词的含义和相应的同义词如下：

Input X: Term
Output Y: Sense  
Output L: Synonyms in this Sense  

s_helper(X,Y) :- s(X,_,Y,_,_,_).  
?- setof(H,(s_helper(Y,X),s_helper(Y,H)),L).

示例：

?- setof(H,(s_helper(Y,'discouraged'),s_helper(Y,H),L).  
Y = 301664880,  
L = [demoralised, demoralized, discouraged, disheartened] ;  
Y = 301992418,  
L = [discouraged] ;  
No

对于问题的第二部分，WordNet 术语是
单词序列。所以你可以搜索这个WordNet术语
对于如下单词：

Input X: Word  
Output Y: Term

s_helper(X) :- s(_,_,X,_,_,_).  
word_in_term(X,Y) :- atom_concat(X,' ',H), sub_atom(Y,0,_,_,H).
word_in_term(X,Y) :- atom_concat(' ',X,H), atom_concat(H,' ',J), sub_atom(Y,_,_,_,J).
word_in_term(X,Y) :- atom_concat(' ',X,H), sub_atom(Y,_,_,0,H).
?- s_helper(Y), word_in_term(X,Y).

示例：

?- s_helper(X), word_in_term('beat',X).  
X = 'beat generation' ;  
X = 'beat in' ;  
X = 'beat about' ;  
X = 'beat around the bush' ;  
X = 'beat out' ;  
X = 'beat up' ;  
X = 'beat up' ;  
X = 'beat back' ;  
X = 'beat out' ;  
X = 'beat down' ;  
X = 'beat a retreat' ;  
X = 'beat down' ;  
X = 'beat down' ;  
No

这将为您提供潜在的 n 元语法，但没有那么多
形态变异。 WordNet 也展示了一些
词汇关系，这可能很有用。

但我给出的两个 Prolog 查询都不是很有效。
问题是缺少一些单词索引。爪哇
实施当然可以实现更好的东西。
想象一下：

class Synset {  
    static Hashtable<Integer,Synset> synset_access;  
    static Hashtable<String,Vector<Synset>> term_access;  
}

一些 Prolog 可以通过索引指令做同样的事情，它是
可以指示 Prolog 系统对多个索引进行索引
谓词的参数。

建立网络服务也不应该那么困难
在 Java 或 Prolog 中。许多 Prologs 系统很容易允许嵌入
Web 服务器中的 Prolog 程序和 Java 支持 servlet。

支持 Web 服务器的 Prolog 列表可以在这里找到：

http://en.wikipedia。 org/wiki/Comparison_of_Prolog_implementations#Operating_system_and_Web-lated_features

最好的问候

It is easiest to understand the WordNet data by looking
at the Prolog files. They are documented here:

http://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html

WordNet terms are group into synsets. A synset is a maximal
synonym set. Synsets have a primary key so that they can be used
in semantic relationships.

So answering your first question, you can list the different
senses and corresponding synonyms of a word as follows:

Input X: Term
Output Y: Sense  
Output L: Synonyms in this Sense  

s_helper(X,Y) :- s(X,_,Y,_,_,_).  
?- setof(H,(s_helper(Y,X),s_helper(Y,H)),L).

Example:

?- setof(H,(s_helper(Y,'discouraged'),s_helper(Y,H),L).  
Y = 301664880,  
L = [demoralised, demoralized, discouraged, disheartened] ;  
Y = 301992418,  
L = [discouraged] ;  
No

For the second part of your question, WordNet terms are
sequences of words. So you can search this WordNet terms
for words as follows:

Input X: Word  
Output Y: Term

s_helper(X) :- s(_,_,X,_,_,_).  
word_in_term(X,Y) :- atom_concat(X,' ',H), sub_atom(Y,0,_,_,H).
word_in_term(X,Y) :- atom_concat(' ',X,H), atom_concat(H,' ',J), sub_atom(Y,_,_,_,J).
word_in_term(X,Y) :- atom_concat(' ',X,H), sub_atom(Y,_,_,0,H).
?- s_helper(Y), word_in_term(X,Y).

Example:

?- s_helper(X), word_in_term('beat',X).  
X = 'beat generation' ;  
X = 'beat in' ;  
X = 'beat about' ;  
X = 'beat around the bush' ;  
X = 'beat out' ;  
X = 'beat up' ;  
X = 'beat up' ;  
X = 'beat back' ;  
X = 'beat out' ;  
X = 'beat down' ;  
X = 'beat a retreat' ;  
X = 'beat down' ;  
X = 'beat down' ;  
No

This would give you potential n-grams, but no so much
morphological variation. WordNet does also exhibit some
lexical relations, which could be useful.

But both Prolog queries I have given are not very efficient.
The problem is the lack of some word indexing. A Java
implementation could of course implement something better.
Just imagine something along:

class Synset {  
    static Hashtable<Integer,Synset> synset_access;  
    static Hashtable<String,Vector<Synset>> term_access;  
}

Some Prolog can do the same, by a indexing directive, it is
possible to instruct the Prolog system to index on multiple
arguments for a predicate.

Putting up a web service shouldn't be that difficult, either
in Java or Prolog. Many Prologs systems easily allow embedding
Prolog programs in web servers, and Java champions servlets.

A list of Prologs that support web servers can be found here:

http://en.wikipedia.org/wiki/Comparison_of_Prolog_implementations#Operating_system_and_Web-related_features

Best Regards

回复收藏 0 原文

赏烟花じ飞满天 2024-12-05 14:50:23

这是两个不同的问题。

1）Wordnet和Python。使用 NLTK，它有一个很好的wordnet 接口。你可以自己写点东西，但说实话，为什么要让生活变得困难呢？ Lingpipe 可能也有内置的东西，但 NLTK 更容易使用。我认为 nltk 只是下载一个 ntlk 数据库，但我很确定有 api 可以与 wordnet 对话。

2) 要在 nltk 中获取二元组，请按照此操作教程。一般来说，您对文本进行标记，然后通过向前和向后迭代句子来获取每个单词的所有 n 元语法。

回复收藏 0 原文

昇り龍 2024-12-05 14:50:23

作为 NLTK 的替代方案，您可以使用一个可用的 WordNet SPARQL 端点来检索此类信息。查询示例：

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wordnet: <http://www.w3.org/2006/03/wn/wn20/schema/>

SELECT DISTINCT ?label {
  ?input_word a wordnet:WordSense;
     rdfs:label ?input_label.
  FILTER (?input_label = 'run')
  ?synset wordnet:containsWordSense ?input_word.
  ?synset wordnet:containsWordSense ?synonym.
  ?synonym rdfs:label ?label.
} LIMIT 100

在 Java 世界中， Jena 和 Sesame框架。

As alternative to NLTK, you can use one of available WordNet SPARQL endpoints to retrieve such information. Query example:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wordnet: <http://www.w3.org/2006/03/wn/wn20/schema/>

SELECT DISTINCT ?label {
  ?input_word a wordnet:WordSense;
     rdfs:label ?input_label.
  FILTER (?input_label = 'run')
  ?synset wordnet:containsWordSense ?input_word.
  ?synset wordnet:containsWordSense ?synonym.
  ?synonym rdfs:label ?label.
} LIMIT 100

In Java universe, Jena and Sesame frameworks can be used.

回复收藏 0 原文

~没有更多了~