给定 Perl 中 Text::DocumentCollection
中的文档集合,我想使用 余弦相似度 href="https://metacpan.org/module/Text::Document" rel="nofollow noreferrer">Text::Document
。
我认为这可能可以使用 EnumerateV 和回调来完成,但我无法弄清楚具体细节。 (这个问题很有帮助,但我仍然卡住了。)
具体来说,假设集合存储在 test.db
中,如下所示:
#!/usr/bin/perl -w
use Text::DocumentCollection;
use Text::Document;
$c = Text::DocumentCollection->new( file => 'test.db' );
my $text = 'Stack Overflow is a programming | Q & A site that’s free. Free to ask | questions, free to answer questions|, free to read, free to index';
my @strings = split /\|/, $text;
my $i=0;
foreach (@strings) {
my $doc = Text::Document->new();
$doc->AddContent($_);
$c->Add(++$i,$doc);
}
现在假设我需要读取 test.db
并计算所有组合的余弦相似度的文件。 (除了通过存储的数据库文件之外,我无法访问在上面的代码中创建的文档。)
我认为答案是构造一个通过 EnumerateV
中的回调访问的子例程,并且我猜测子例程也调用了 EnumerateV
但我还没能弄清楚。
Given a document collection in Text::DocumentCollection
in Perl, I want to calculate the cosine similarity between any two documents in the collection using Text::Document
.
I think this can probably be done using EnumerateV
and callbacks, but I'm having trouble figuring out the specifics. (This SO question is helpful, but I'm still stuck.)
To be specific, suppose the collection is stored in test.db
as follows:
#!/usr/bin/perl -w
use Text::DocumentCollection;
use Text::Document;
$c = Text::DocumentCollection->new( file => 'test.db' );
my $text = 'Stack Overflow is a programming | Q & A site that’s free. Free to ask | questions, free to answer questions|, free to read, free to index';
my @strings = split /\|/, $text;
my $i=0;
foreach (@strings) {
my $doc = Text::Document->new();
$doc->AddContent($_);
$c->Add(++$i,$doc);
}
Now suppose I need to read in test.db
and calculate cosine similarity for all combinations of documents. (I don't have access to the documents created in the code above other than through the stored database file.)
I think the answer is in constructing a subroutine that is accessed with the callback in EnumerateV
, and I'm guessing that the subroutine also calls EnumerateV
but I haven't been able to figure it out.
发布评论
评论(1)
您可能想从这样的事情开始:
You might want to start with something like this: