Perl 段落 n 元语法
假设我有一个文本句子:
$body = 'the quick brown fox jumps over the lazy dog';
我想将该句子放入“关键字”的哈希值中,但我想允许使用多单词关键字;我有以下方法来获取单个单词关键字:
$words{$_}++ for $body =~ m/(\w+)/g;
完成后,我有一个如下所示的哈希:
'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1
下一步,以便我可以获得 2 个单词的关键字,如下所示:
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
但这只能获取每个“其他” “ 一对;看起来像这样:
'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1
我还需要一个字偏移:
'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1
有没有比下面更简单的方法来做到这一点?
my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
Let's say I have a sentence of text:
$body = 'the quick brown fox jumps over the lazy dog';
and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords:
$words{$_}++ for $body =~ m/(\w+)/g;
After this is complete, I have a hash that looks like the following:
'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1
The next step, so that I can get 2-word keywords, is the following:
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
But that only gets every "other" pair; looking like this:
'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1
I also need the one word offset:
'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1
Is there an easier way to do this than the following?
my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
虽然所描述的任务手动编码可能很有趣,
使用处理 n-gram 的现有 CPAN 模块不是更好吗?它看起来像
Text::Ngrams
(而不是Text::Ngram
) 可以处理基于单词的 n-gram 分析。While the described task might be interesting to code by hand,
would not it be better to use an existing CPAN module that handles n-grams? It looks like
Text::Ngrams
(as opposed toText::Ngram
) can handle word-based n-gram analysis.您可以使用 lookaheads 做一些有点时髦的事情:
如果我这样做:
该表达式表示要向前查找两个单词(并捕获它们),但消耗 1。
我得到:
看来我可以通过放入计数变量来概括这一点:
You can do something a little funky with lookaheads:
If I do:
That expression says to look ahead for two words (and capture them), but consume 1.
I get:
It seems I can generalize this by putting in a variable for count:
我会使用 look-ahead 来收集除第一个单词之外的所有内容。这样,位置
会自动正确前进:如果您想坚持使用单个空格而不是
\s+
(不要忘记删除/x 修饰符(如果您这样做),因为您可以在
$2
中收集任意数量的单词,而不是每个单词使用一组。I would use look-ahead to collect everything but the first word. That way, the position advances correctly automatically:
You could simplify it a bit if you want to stick with a single space instead of
\s+
(don't forget to remove the/x
modifier if you do that), since you could collect any number of words in$2
, instead of using one group per word.单独使用正则表达式执行此操作是否有任何特殊原因?对我来说,显而易见的方法是将文本
分割
到一个数组中,然后使用一对嵌套循环从中提取计数。输出:
编辑: 修复了
$phrase_len
循环,以防止使用负索引,根据 cjm 的评论,负索引会导致错误的结果。Is there any particular reason for doing this using regexes alone? The obvious approach to me would to
split
the text into an array, then use a pair of nested loops to extract your counts from it. Something along the lines of:Output:
Edit: Fixed
$phrase_len
loop to prevent use of negative indexes, which was causing incorrect results, per cjm's comment.使用
pos
运算符和
@-
特殊数组例如,下面的程序在其自己的捕获中抓取每对的第二个单词,并倒回匹配的位置,因此第二个单词将成为下一对的第一个单词:
输出:
Use the
pos
operatorand the
@-
special arrayFor example, the program below grabs each pair's second word in its own capture and rewinds the match's position so what was the second word will be the next pair's first word:
Output: