使用 Perl 的 Wordnet 同义词集
我安装了 Wordnet::Similarity 和 < a href="http://search.cpan.org/dist/WordNet-QueryData/QueryData.pm" rel="nofollow">Wordnet::QueryData 作为计算信息的简单方法这些模块附带的内容分数和概率。但我陷入了这个基本问题:给定一个单词,打印 n 个与其相似的单词 - 迭代同义词集并执行 join
应该不难。
使用 wn
命令并使用大量 tr
、sort | 进行管道传输。 uniq
我可以得到所有的单词:
wn cat -synsn | grep -v Sense | tr '=' ' ' | tr '>' ' ' | tr '\t' ' ' | tr ',' '\n' | sort | uniq
OUTPUT,
8 senses of cat
adult female
adult male
African tea
Arabian tea
big cat
bozo
cat
cat
CAT
Caterpillar
cat-o'-nine-tails
computed axial tomography
computed tomography
computerized axial tomography
computerized tomography
CT
excitant
felid
feline
gossip
gossiper
gossipmonger
guy
hombre
kat
khat
man
newsmonger
qat
quat
rumormonger
rumourmonger
stimulant
stimulant drug
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun cat
tracked vehicle
true cat
whip
woman
X-radiation
X-raying
但它有点令人讨厌,需要进一步清理。
我的脚本如下所示,我想要得到的是 cat#n1...8 中的所有单词。
脚本
use WordNet::QueryData;
my $wn = WordNet::QueryData->new( noload => 1);
print "Senses: ", join(", ", $wn->querySense("cat#n")), "\n";
print "Synset: ", join(", ", $wn->querySense("cat", "syns")), "\n";
print "Hyponyms: ", join(", ", $wn->querySense("cat#n#1", "hypo")), "\n";
输出:
Senses: cat#n#1, cat#n#2, cat#n#3, cat#n#4, cat#n#5, cat#n#6, cat#n#7, cat#n#8
Synset: cat#n, cat#v
Hyponyms: domestic_cat#n#1, wildcat#n#3
脚本
use WordNet::QueryData;
my $wn = WordNet::QueryData->new;
foreach $word (qw/cat#n/) {
@senses = $wn->querySense($word);
foreach $wps (@senses) {
@gloss = $wn -> querySense($wps, "syns");
print "$wps : @gloss\n";
}
}
输出:
cat#n#1 : cat#n#1 true_cat#n#1
cat#n#2 : guy#n#1 cat#n#2 hombre#n#1 bozo#n#2
cat#n#3 : cat#n#3
cat#n#4 : kat#n#1 khat#n#1 qat#n#1 quat#n#1 cat#n#4 Arabian_tea#n#1 African_tea#n#1
cat#n#5 : cat-o'-nine-tails#n#1 cat#n#5
cat#n#6 : Caterpillar#n#2 cat#n#6
cat#n#7 : big_cat#n#1 cat#n#7
cat#n#8 : computerized_tomography#n#1 computed_tomography#n#1 CT#n#2 computerized_axial_tomography#n#1 computed_axial_tomography#n#1 CAT#n#8
PS 我以前从未写过 Perl,但从早上起就一直在研究 Perl 脚本 - 现在可以理解基本的东西。只需知道是否有更干净的方法可以使用 api 文档来执行此操作 - 无法从 api 或用户组档案中找出答案。
更新:
我想我会解决:
wn cat -synsn | sed '1,6d' |sed 's/Sense [[:digit:]]//g' | sed 's/[[:space:]]*=> //' | sed '/^$/d'
sed 岩石!
I installed Wordnet::Similarity and Wordnet::QueryData as an easy way to calculate information content score and probability that comes with these modules. But I'm stuck at this basic problem: given a word, print n words similar to it - which should not be difficult that iterating through the synsets and doing join
.
using the wn
command and piping it with a whole lot of tr
, sort | uniq
I can get all the words:
wn cat -synsn | grep -v Sense | tr '=' ' ' | tr '>' ' ' | tr '\t' ' ' | tr ',' '\n' | sort | uniq
OUTPUT
8 senses of cat
adult female
adult male
African tea
Arabian tea
big cat
bozo
cat
cat
CAT
Caterpillar
cat-o'-nine-tails
computed axial tomography
computed tomography
computerized axial tomography
computerized tomography
CT
excitant
felid
feline
gossip
gossiper
gossipmonger
guy
hombre
kat
khat
man
newsmonger
qat
quat
rumormonger
rumourmonger
stimulant
stimulant drug
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun cat
tracked vehicle
true cat
whip
woman
X-radiation
X-raying
but its kinda nasty,and needs further clean up.
What my script looks like is below, and what I want to get is all the words in cat#n1...8.
SCRIPT
use WordNet::QueryData;
my $wn = WordNet::QueryData->new( noload => 1);
print "Senses: ", join(", ", $wn->querySense("cat#n")), "\n";
print "Synset: ", join(", ", $wn->querySense("cat", "syns")), "\n";
print "Hyponyms: ", join(", ", $wn->querySense("cat#n#1", "hypo")), "\n";
OUTPUT:
Senses: cat#n#1, cat#n#2, cat#n#3, cat#n#4, cat#n#5, cat#n#6, cat#n#7, cat#n#8
Synset: cat#n, cat#v
Hyponyms: domestic_cat#n#1, wildcat#n#3
SCRIPT
use WordNet::QueryData;
my $wn = WordNet::QueryData->new;
foreach $word (qw/cat#n/) {
@senses = $wn->querySense($word);
foreach $wps (@senses) {
@gloss = $wn -> querySense($wps, "syns");
print "$wps : @gloss\n";
}
}
OUTPUT:
cat#n#1 : cat#n#1 true_cat#n#1
cat#n#2 : guy#n#1 cat#n#2 hombre#n#1 bozo#n#2
cat#n#3 : cat#n#3
cat#n#4 : kat#n#1 khat#n#1 qat#n#1 quat#n#1 cat#n#4 Arabian_tea#n#1 African_tea#n#1
cat#n#5 : cat-o'-nine-tails#n#1 cat#n#5
cat#n#6 : Caterpillar#n#2 cat#n#6
cat#n#7 : big_cat#n#1 cat#n#7
cat#n#8 : computerized_tomography#n#1 computed_tomography#n#1 CT#n#2 computerized_axial_tomography#n#1 computed_axial_tomography#n#1 CAT#n#8
P.S.
I have never written perl before, but have been looking into perl scripts since morning - and can now understand the basic stuff. Just need to know if there is cleaner way to do this using the api docs - couldn't figure out from the api or usergroup archives.
Update:
I think I'll settle with:
wn cat -synsn | sed '1,6d' |sed 's/Sense [[:digit:]]//g' | sed 's/[[:space:]]*=> //' | sed '/^$/d'
sed rocks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我想您会发现以下内容很有帮助...
http://marimba.d.umn。 edu/WordNet-Pairs/
根据 WordNet,与 X 最相似的 N 个单词是什么?
该数据旨在回答这个问题,其中相似性基于
来自 WordNet::Similarity 的测量。 http://wn-similarity.sourceforge.net
------------ -- 动词数据
这些文件是使用 WordNet::Similarity 版本 2.05 创建的
WordNet 3.0。它们显示了发现的所有成对动词-动词相似之处
在 WordNet 中根据路径、wup、lch、lin、res 和 jcn 度量。
path、wup、lch是基于路径的,而res、lin、jcn是基于路径的
关于信息内容。
截至 2011 年 3 月 15 日,所有动词使用六种成对测量
上述措施均可用,每个措施都在其自己的 .tar 文件中。每个 *.tar
文件名为 WordNet-verb-verb-MEASURE-pairs.tar,大小约为
2.0 - 2.4 GB 压缩。在每个 .tar 文件中,您都会发现
25,047 个文件,每个动词含义一个。每个文件由 25,048 行组成,
其中每一行(第一行除外)都包含一个 WordNet 动词含义,并且
与该特定文件中的含义相似。正在做
计算一下,你会发现每个 .tar 文件包含大约 625,000,000
成对相似度值。请注意,这些是对称的(sim (A,B)
= sim (B,A)) 所以你有超过 3 亿个唯一值。
-------------- 名词数据
截至 2011 年 8 月 19 日 使用路径的所有名词的成对测量
措施可用。该文件名为 WordNet-noun-noun-path-pairs.tar。
压缩后的大小约为 120 GB。在这个文件中你会发现
146,312 个文件,每个名词含义一个。每个文件由
146,313 行,其中每一行(第一行除外)包含一个 WordNet
名词意义以及与该特定意义的相似性
文件。在这里进行数学计算,您会发现每个 .tar 文件包含
大约 21,000,000,000 个成对相似度值。请注意,这些
是对称的(sim (A,B) = sim (B,A)),所以你有大约 100 亿
独特的价值观。
我们目前正在运行 wup、res 和 lesk,但没有
预计可用日期尚未确定。
I think you'll find the following hepful...
http://marimba.d.umn.edu/WordNet-Pairs/
What are the N most similar words to X, according to WordNet?
This data seeks to answer that question, where similarity is based on
measures from WordNet::Similarity. http://wn-similarity.sourceforge.net
-------------- verb data
These files were created with WordNet::Similarity version 2.05 using
WordNet 3.0. They show all the pairwise verb-verb similarities found
in WordNet according to the path, wup, lch, lin, res, and jcn measures.
The path, wup, and lch are path-based, while res, lin, and jcn are based
on information content.
As of March 15, 2011 pairwise measures for all verbs using the six
measures above are availble, each in their own .tar file. Each *.tar
file is named as WordNet-verb-verb-MEASURE-pairs.tar, and is approx
2.0 - 2.4 GB compressed. In each of these .tar files you will find
25,047 files, one for each verb sense. Each file consists of 25,048 lines,
where each line (except the first) contains a WordNet verb sense and the
similarity to the sense featured in that particular file. Doing
the math here, you find that each .tar file contains about 625,000,000
pairwise similarity values. Note that these are symmetric (sim (A,B)
= sim (B,A)) so you have a bit more than 300 million unique values.
-------------- noun data
As of August 19, 2011 pairwise measures for all nouns using the path
measure are available. This file is named WordNet-noun-noun-path-pairs.tar.
It is approximately 120 GB compressed. In this file you will find
146,312 files, one for each noun sense. Each file consists of
146,313 lines, where each line (except the first) contains a WordNet
noun sense and the similarity to the sense featured in that particular
file. Doing the math here, you find that each .tar file contains
about 21,000,000,000 pairwise similarity values. Note that these
are symmetric (sim (A,B) = sim (B,A)) so you have around 10 billion
unique values.
We are currently running wup, res, and lesk, but do not have an
estimated date of availability yet.
把这是一个脚本,说 synonym.sh
从你的 perl 脚本
Put this is a script, say synonym.sh
From your perl script