有没有办法匹配句子中的所有相邻单词?
my $line = "The quick brown fox jumps over the lazy dog.";
while ($line){
$line =~ s/["",]//ig; #[] means to get rid of
#print $line
$line = lc($line); #lc is lowercase
while ($line=~m/\b(\w+\s\w+)\b/ig){ #[^ ] means any character except spaces and newline #($line=~m/\b(\s\w+\s\w+)\b/ig)
my $word =$1;
print "$word\n";
$wordcount{$word} += 1;
}
last;
}
close(INPUT);
close(OUTPUT);
想要的排名将是:快速,快速的棕色,棕色的狐狸,狐狸跳。但是,对于上面的代码,我只得到快速,棕色的狐狸,跳过...。
my $line = "The quick brown fox jumps over the lazy dog.";
while ($line){
$line =~ s/["",]//ig; #[] means to get rid of
#print $line
$line = lc($line); #lc is lowercase
while ($line=~m/\b(\w+\s\w+)\b/ig){ #[^ ] means any character except spaces and newline #($line=~m/\b(\s\w+\s\w+)\b/ig)
my $word =$1;
print "$word\n";
$wordcount{$word} += 1;
}
last;
}
close(INPUT);
close(OUTPUT);
Desired out put will be: the quick, quick brown, brown fox, fox jumps.... However, for the code above I am only getting the quick, brown fox, jumps over....
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
可以使用a
对根据需要重叠打印。这允许单词之间的任何数量的空间。†
一个解释。
在用
(\ W+)
捕获一个单词后,LookAhead(?= ...)
仅断言(“向前看”) “消耗”它也无法超越它(当我们用(额外的)帕伦斯捕获它时,我们在$ 1
和$ 2
中获得了两个单词。我们只消耗了一个单词,正则发动机在第一个单词之后就停留在空间之后。因此,在下一个迭代中,它可以匹配下一个单词,这是Lookahead的最后一个“看到”。然后它再次发现了lookahead的下一个单词,再次捕获了这两种单词。等等,因此重叠。
†删除
+
,仅使用\ s
,如果您确实只想允许一个 whitespace 。如果仅想要一个字面空间 - 没有标签等,请参阅链接以获取\ s
匹配的内容 - 然后,而不是\ s
使用(字面空间, space )或
[]
,“字符类”(括号)内的字面空间(括号),为了清楚起见。Can capture both but not consume the second using a lookahead, so that pairs overlap
Prints as desired. This allows any amount of whitespace between words.†
An explanation.
After a word is captured with
(\w+)
, the lookahead(?=...)
merely asserts ("looks ahead") that another word follows but doesn't "consume" it nor advances past it (while we capture it with (extra) parens, so we get two words in$1
and$2
). We consumed just one word and the regex engine stays right after the space(s) following the first word.So in the next iteration it can match the next word, the one last "seen" by lookahead. Then it again spots yet the next word by the lookahead, again capturing both. Etc. Thus the overlap.
† Drop that
+
and use only\s
if you indeed want to allow only one whitespace. If you want a literal space only -- no tabs etc, see the link for what\s
matches -- then instead of\s
use(literal space, SPACE ) or
[ ]
, literal space inside a "character class" (brackets), for clarity.您可以使用
REGEX说明
(
捕获组\ w+
匹配一个字(
捕获组\ w+\ b
匹配一个字)
关闭组)
close lookahead请参阅regex demo
perl示例
输出
You can use
Regex Explanation
(
Capturing group\w+
Match a word)
Close group\s
Match a space(?=
Lookahead assertion - assert that the following regex matches(
Capturing group\w+\b
Match a word)
Close group)
Close lookaheadSee regex demo
Perl Example
Output
如果将字符串分为一系列单词,则根本不需要用正则表达式做任何事情:
You don't need to do anything fancy with regular expressions at all if you split the string up into an array of words: