Word Net - 单词同义词和同义词相关单词构造 - Java 或 Python

发布于 2024-11-28 14:50:23 字数 252 浏览 5 评论 0原文

我希望使用 WordNet 从一组基本术语中查找相似术语的集合。

例如,单词“不鼓励” - 潜在同义词可能是:畏惧、忧郁、受阻、悲观

我还想识别潜在的二元语法,例如: 击败、推迟、屈服等。

我如何使用 Java 或 Python 提取这些信息?是否有任何托管的 WordNet 数据库/Web 界面允许此类查询?

谢谢!

I am looking to use WordNet to look for a collection of like terms from a base set of terms.

For example, the word 'discouraged' - potential synonyms could be: daunted, glum, deterred, pessimistic.

I also wanted to identify potential bi-grams such as; beat down, put off, caved in etc.

How do I go about extracting this information using Java or Python? Are there any hosted WordNet databases/web interfaces which would allow such querying?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦里人 2024-12-05 14:50:23

通过查看最容易理解 WordNet 数据
在 Prolog 文件中。它们记录在此处:

http://wordnet.princeton.edu/wordnet/man/ prologdb.5WN.html

WordNet 术语分为同义词集。同义词集是一个最大的
同义词集。同义词集有一个主键,以便可以使用它们
在语义关系中。

因此,回答你的第一个问题,你可以列出不同的
单词的含义和相应的同义词如下:

Input X: Term
Output Y: Sense  
Output L: Synonyms in this Sense  

s_helper(X,Y) :- s(X,_,Y,_,_,_).  
?- setof(H,(s_helper(Y,X),s_helper(Y,H)),L).  

示例:

?- setof(H,(s_helper(Y,'discouraged'),s_helper(Y,H),L).  
Y = 301664880,  
L = [demoralised, demoralized, discouraged, disheartened] ;  
Y = 301992418,  
L = [discouraged] ;  
No  

对于问题的第二部分,WordNet 术语是
单词序列。所以你可以搜索这个WordNet术语
对于如下单词:

Input X: Word  
Output Y: Term

s_helper(X) :- s(_,_,X,_,_,_).  
word_in_term(X,Y) :- atom_concat(X,' ',H), sub_atom(Y,0,_,_,H).
word_in_term(X,Y) :- atom_concat(' ',X,H), atom_concat(H,' ',J), sub_atom(Y,_,_,_,J).
word_in_term(X,Y) :- atom_concat(' ',X,H), sub_atom(Y,_,_,0,H).
?- s_helper(Y), word_in_term(X,Y).

示例:

?- s_helper(X), word_in_term('beat',X).  
X = 'beat generation' ;  
X = 'beat in' ;  
X = 'beat about' ;  
X = 'beat around the bush' ;  
X = 'beat out' ;  
X = 'beat up' ;  
X = 'beat up' ;  
X = 'beat back' ;  
X = 'beat out' ;  
X = 'beat down' ;  
X = 'beat a retreat' ;  
X = 'beat down' ;  
X = 'beat down' ;  
No

这将为您提供潜在的 n 元语法,但没有那么多
形态变异。 WordNet 也展示了一些
词汇关系,这可能很有用。

但我给出的两个 Prolog 查询都不是很有效。
问题是缺少一些单词索引。爪哇
实施当然可以实现更好的东西。
想象一下:

class Synset {  
    static Hashtable<Integer,Synset> synset_access;  
    static Hashtable<String,Vector<Synset>> term_access;  
}

一些 Prolog 可以通过索引指令做同样的事情,它是
可以指示 Prolog 系统对多个索引进行索引
谓词的参数。

建立网络服务也不应该那么困难
在 Java 或 Prolog 中。许多 Prologs 系统很容易允许嵌入
Web 服务器中的 Prolog 程序和 Java 支持 servlet。

支持 Web 服务器的 Prolog 列表可以在这里找到:

http://en.wikipedia。 org/wiki/Comparison_of_Prolog_implementations#Operating_system_and_Web-lated_features

最好的问候

It is easiest to understand the WordNet data by looking
at the Prolog files. They are documented here:

http://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html

WordNet terms are group into synsets. A synset is a maximal
synonym set. Synsets have a primary key so that they can be used
in semantic relationships.

So answering your first question, you can list the different
senses and corresponding synonyms of a word as follows:

Input X: Term
Output Y: Sense  
Output L: Synonyms in this Sense  

s_helper(X,Y) :- s(X,_,Y,_,_,_).  
?- setof(H,(s_helper(Y,X),s_helper(Y,H)),L).  

Example:

?- setof(H,(s_helper(Y,'discouraged'),s_helper(Y,H),L).  
Y = 301664880,  
L = [demoralised, demoralized, discouraged, disheartened] ;  
Y = 301992418,  
L = [discouraged] ;  
No  

For the second part of your question, WordNet terms are
sequences of words. So you can search this WordNet terms
for words as follows:

Input X: Word  
Output Y: Term

s_helper(X) :- s(_,_,X,_,_,_).  
word_in_term(X,Y) :- atom_concat(X,' ',H), sub_atom(Y,0,_,_,H).
word_in_term(X,Y) :- atom_concat(' ',X,H), atom_concat(H,' ',J), sub_atom(Y,_,_,_,J).
word_in_term(X,Y) :- atom_concat(' ',X,H), sub_atom(Y,_,_,0,H).
?- s_helper(Y), word_in_term(X,Y).

Example:

?- s_helper(X), word_in_term('beat',X).  
X = 'beat generation' ;  
X = 'beat in' ;  
X = 'beat about' ;  
X = 'beat around the bush' ;  
X = 'beat out' ;  
X = 'beat up' ;  
X = 'beat up' ;  
X = 'beat back' ;  
X = 'beat out' ;  
X = 'beat down' ;  
X = 'beat a retreat' ;  
X = 'beat down' ;  
X = 'beat down' ;  
No

This would give you potential n-grams, but no so much
morphological variation. WordNet does also exhibit some
lexical relations, which could be useful.

But both Prolog queries I have given are not very efficient.
The problem is the lack of some word indexing. A Java
implementation could of course implement something better.
Just imagine something along:

class Synset {  
    static Hashtable<Integer,Synset> synset_access;  
    static Hashtable<String,Vector<Synset>> term_access;  
}

Some Prolog can do the same, by a indexing directive, it is
possible to instruct the Prolog system to index on multiple
arguments for a predicate.

Putting up a web service shouldn't be that difficult, either
in Java or Prolog. Many Prologs systems easily allow embedding
Prolog programs in web servers, and Java champions servlets.

A list of Prologs that support web servers can be found here:

http://en.wikipedia.org/wiki/Comparison_of_Prolog_implementations#Operating_system_and_Web-related_features

Best Regards

赏烟花じ飞满天 2024-12-05 14:50:23

这是两个不同的问题。

1)Wordnet和Python。使用 NLTK,它有一个很好的wordnet 接口。你可以自己写点东西,但说实话,为什么要让生活变得困难呢? Lingpipe 可能也有内置的东西,但 NLTK 更容易使用。我认为 nltk 只是下载一个 ntlk 数据库,但我很确定有 api 可以与 wordnet 对话。

2) 要在 nltk 中获取二元组,请按照此操作教程。一般来说,您对文本进行标记,然后通过向前和向后迭代句子来获取每个单词的所有 n 元语法。

These are two different problems.

1) Wordnet and python. Use NLTK, it has a nice interface to wordnet. You could write something on your own, but honestly why make life difficult? Lingpipe probably also has something built in but NLTK is much easier to use. I think nltk just downloads an ntlk database, but I'm pretty sure there are apis to talk to wordnet.

2) To get bigrams in nltk follow this tutorial. In general you tokenize text and then just iterate over the sentence getting all the n-grams for each word by looking forward and backward.

昇り龍 2024-12-05 14:50:23

作为 NLTK 的替代方案,您可以使用一个可用的 WordNet SPARQL 端点来检索此类信息。查询示例:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wordnet: <http://www.w3.org/2006/03/wn/wn20/schema/>

SELECT DISTINCT ?label {
  ?input_word a wordnet:WordSense;
     rdfs:label ?input_label.
  FILTER (?input_label = 'run')
  ?synset wordnet:containsWordSense ?input_word.
  ?synset wordnet:containsWordSense ?synonym.
  ?synonym rdfs:label ?label.
} LIMIT 100

在 Java 世界中, JenaSesame框架。

As alternative to NLTK, you can use one of available WordNet SPARQL endpoints to retrieve such information. Query example:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wordnet: <http://www.w3.org/2006/03/wn/wn20/schema/>

SELECT DISTINCT ?label {
  ?input_word a wordnet:WordSense;
     rdfs:label ?input_label.
  FILTER (?input_label = 'run')
  ?synset wordnet:containsWordSense ?input_word.
  ?synset wordnet:containsWordSense ?synonym.
  ?synonym rdfs:label ?label.
} LIMIT 100

In Java universe, Jena and Sesame frameworks can be used.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文