将所有文档与 Perl 的 Text::DocumentCollection 中的其他文档进行比较

发布于 2024-12-20 05:35:06 字数 1281 浏览 0 评论 0 原文

给定 Perl 中 Text::DocumentCollection 中的文档集合,我想使用 余弦相似度 href="https://metacpan.org/module/Text::Document" rel="nofollow noreferrer">Text::Document

我认为这可能可以使用 EnumerateV 和回调来完成,但我无法弄清楚具体细节。 (这个问题很有帮助,但我仍然卡住了。)

具体来说,假设集合存储在 test.db 中,如下所示:

#!/usr/bin/perl -w
use Text::DocumentCollection;
use Text::Document;

$c = Text::DocumentCollection->new( file => 'test.db' );

my $text = 'Stack Overflow is a programming | Q & A site that’s free. Free to ask | questions, free to answer questions|, free to read, free to index';

my @strings = split /\|/, $text;
my $i=0;

foreach (@strings) {
    my $doc = Text::Document->new();
    $doc->AddContent($_);
    $c->Add(++$i,$doc);
}

现在假设我需要读取 test.db 并计算所有组合的余弦相似度的文件。 (除了通过存储的数据库文件之外,我无法访问在上面的代码中创建的文档。)

我认为答案是构造一个通过 EnumerateV 中的回调访问的子例程,并且我猜测子例程也调用了 EnumerateV 但我还没能弄清楚。

Given a document collection in Text::DocumentCollection in Perl, I want to calculate the cosine similarity between any two documents in the collection using Text::Document.

I think this can probably be done using EnumerateV and callbacks, but I'm having trouble figuring out the specifics. (This SO question is helpful, but I'm still stuck.)

To be specific, suppose the collection is stored in test.db as follows:

#!/usr/bin/perl -w
use Text::DocumentCollection;
use Text::Document;

$c = Text::DocumentCollection->new( file => 'test.db' );

my $text = 'Stack Overflow is a programming | Q & A site that’s free. Free to ask | questions, free to answer questions|, free to read, free to index';

my @strings = split /\|/, $text;
my $i=0;

foreach (@strings) {
    my $doc = Text::Document->new();
    $doc->AddContent($_);
    $c->Add(++$i,$doc);
}

Now suppose I need to read in test.db and calculate cosine similarity for all combinations of documents. (I don't have access to the documents created in the code above other than through the stored database file.)

I think the answer is in constructing a subroutine that is accessed with the callback in EnumerateV, and I'm guessing that the subroutine also calls EnumerateV but I haven't been able to figure it out.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

姜生凉生 2024-12-27 05:35:06

您可能想从这样的事情开始:

$c->EnumerateV(sub {
    my ($c, $k1, $d1) = @_;
    $c->EnumerateV(sub {
        my ($c, $k2, $d2) = @_;
    return if exists $dist{$k1.$k2};
    $dist{$k1.$k2} = $dist{$k2.$k1}= cosine_dist($d1, $d2);
    });
});

You might want to start with something like this:

$c->EnumerateV(sub {
    my ($c, $k1, $d1) = @_;
    $c->EnumerateV(sub {
        my ($c, $k2, $d2) = @_;
    return if exists $dist{$k1.$k2};
    $dist{$k1.$k2} = $dist{$k2.$k1}= cosine_dist($d1, $d2);
    });
});
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文