Perl 段落 n 元语法

发布于 2024-09-15 08:21:40 字数 1260 浏览 10 评论 0原文

假设我有一个文本句子:

$body = 'the quick brown fox jumps over the lazy dog';

我想将该句子放入“关键字”的哈希值中,但我想允许使用多单词关键字;我有以下方法来获取单个单词关键字:

$words{$_}++ for $body =~ m/(\w+)/g;

完成后,我有一个如下所示的哈希:

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

下一步,以便我可以获得 2 个单词的关键字,如下所示:

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

但这只能获取每个“其他” “ 一对;看起来像这样:

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

我还需要一个字偏移:

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

有没有比下面更简单的方法来做到这一点?

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;

Let's say I have a sentence of text:

$body = 'the quick brown fox jumps over the lazy dog';

and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords:

$words{$_}++ for $body =~ m/(\w+)/g;

After this is complete, I have a hash that looks like the following:

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

The next step, so that I can get 2-word keywords, is the following:

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

But that only gets every "other" pair; looking like this:

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

I also need the one word offset:

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

Is there an easier way to do this than the following?

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

<逆流佳人身旁 2024-09-22 08:21:40

虽然所描述的任务手动编码可能很有趣,
使用处理 n-gram 的现有 CPAN 模块不是更好吗?它看起来像 Text::Ngrams (而不是 Text::Ngram) 可以处理基于单词的 n-gram 分析。

While the described task might be interesting to code by hand,
would not it be better to use an existing CPAN module that handles n-grams? It looks like Text::Ngrams (as opposed to Text::Ngram) can handle word-based n-gram analysis.

鲜肉鲜肉永远不皱 2024-09-22 08:21:40

您可以使用 lookaheads 做一些有点时髦的事情:

如果我这样做:

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

该表达式表示要向前查找两个单词(并捕获它们),但消耗 1。

我得到:

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

看来我可以通过放入计数变量来概括这一点:

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;

You can do something a little funky with lookaheads:

If I do:

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

That expression says to look ahead for two words (and capture them), but consume 1.

I get:

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

It seems I can generalize this by putting in a variable for count:

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;
逆蝶 2024-09-22 08:21:40

我会使用 look-ahead 来收集除第一个单词之外的所有内容。这样,位置

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

会自动正确前进:如果您想坚持使用单个空格而不是 \s+ (不要忘记删除 /x 修饰符(如果您这样做),因为您可以在 $2 中收集任意数量的单词,而不是每个单词使用一组。

I would use look-ahead to collect everything but the first word. That way, the position advances correctly automatically:

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

You could simplify it a bit if you want to stick with a single space instead of \s+ (don't forget to remove the /x modifier if you do that), since you could collect any number of words in $2, instead of using one group per word.

夜访吸血鬼 2024-09-22 08:21:40

单独使用正则表达式执行此操作是否有任何特殊原因?对我来说,显而易见的方法是将文本分割到一个数组中,然后使用一对嵌套循环从中提取计数。输出

#!/usr/bin/env perl

use strict;
use warnings;

my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;

my @words = split / /, $text;
my %counts;

for my $pos (0 .. $#words) {
  for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
    my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
    $counts{$phrase}++;
  }
} 

use Data::Dumper;
print Dumper(\%counts);

$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

编辑: 修复了 $phrase_len 循环,以防止使用负索引,根据 cjm 的评论,负索引会导致错误的结果。

Is there any particular reason for doing this using regexes alone? The obvious approach to me would to split the text into an array, then use a pair of nested loops to extract your counts from it. Something along the lines of:

#!/usr/bin/env perl

use strict;
use warnings;

my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;

my @words = split / /, $text;
my %counts;

for my $pos (0 .. $#words) {
  for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
    my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
    $counts{$phrase}++;
  }
} 

use Data::Dumper;
print Dumper(\%counts);

Output:

$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

Edit: Fixed $phrase_len loop to prevent use of negative indexes, which was causing incorrect results, per cjm's comment.

不美如何 2024-09-22 08:21:40

使用 pos 运算符

位置标量

返回最后一次 m//g 搜索所涉及变量的偏移量(未指定变量时使用 $_)。< /p>

@- 特殊数组

@LAST_MATCH_START

@-

$-[0] 是最后一次成功匹配的开始位置的偏移量。 $-[n] 是与第 n 个子模式匹配的子字符串开头的偏移量,如果子模式不匹配,则为 undef .

例如,下面的程序在其自己的捕获中抓取每对的第二个单词,并倒回匹配的位置,因此第二个单词将成为下一对的第一个单词:

#! /usr/bin/perl

use warnings;
use strict;

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;
while ($body =~ /(\w+ (\w+))/g) {
  ++$words{$1};
  pos($body) = $-[2];
}

for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
  print "'$_' => $words{$_}\n";
}

输出:

'the quick' => 1
'quick brown' => 1
'brown fox' => 1
'fox jumps' => 1
'jumps over' => 1
'over the' => 1
'the lazy' => 1
'lazy dog' => 1

Use the pos operator

pos SCALAR

Returns the offset of where the last m//g search left off for the variable in question ($_ is used when the variable is not specified).

and the @- special array

@LAST_MATCH_START

@-

$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.

For example, the program below grabs each pair's second word in its own capture and rewinds the match's position so what was the second word will be the next pair's first word:

#! /usr/bin/perl

use warnings;
use strict;

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;
while ($body =~ /(\w+ (\w+))/g) {
  ++$words{$1};
  pos($body) = $-[2];
}

for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
  print "'$_' => $words{$_}\n";
}

Output:

'the quick' => 1
'quick brown' => 1
'brown fox' => 1
'fox jumps' => 1
'jumps over' => 1
'over the' => 1
'the lazy' => 1
'lazy dog' => 1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文