Perl - 正则表达式 - 第一个不匹配字符的位置

发布于 2024-12-08 21:57:58 字数 435 浏览 1 评论 0原文

我想找到字符串中正则表达式停止匹配的位置。

简单示例:

my $x = 'abcdefghijklmnopqrstuvwxyz';
$x =~ /gho/;

此示例将给出字符“h”的位置,因为“h”匹配,而“o”是第一个不匹配的字符。

我想过使用 pos 或 $- 但它没有写在不成功的匹配上。 另一个解决方案是迭代地缩短正则表达式模式直到它匹配,但这非常难看并且不适用于复杂的模式。

编辑:

好吧,对于语言学家来说:我对我糟糕的解释感到抱歉。

为了澄清我的情况:如果您将正则表达式视为有限自动机,则存在一个点,测试会中断,因为字符不适合。这一点就是我正在寻找的。

使用迭代括号(如 eugene y 提到的)是一个好主意,但它不适用于量词,我必须编辑模式。

还有其他想法吗?

I want to find the position in a string, where a regular expression stops matching.

Simple example:

my $x = 'abcdefghijklmnopqrstuvwxyz';
$x =~ /gho/;

This example shall give me the position of the character 'h' because 'h' matches and 'o' is the first nonmatching character.

I thought of using pos or $- but it is not written on unsuccessful match.
Another solution would be to iteratively shorten the regex pattern until it matches but that's very ugly and doesn't work on complex patterns.

EDIT:

Okay for the linguists: I'm sorry for my awful explanation.

To clarify my situation: If you think of a regular expression as a finite automaton, there is a point, where the testing interrupts, because a character doesn't fit. This point is what I'm searching for.

Use of iterative paranthesis (as mentioned by eugene y) is a nice idea, but it doesn't work with quantifiers and I had to edit the pattern.

Are there other ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

病女 2024-12-15 21:57:58

您提出的建议很困难,但可行

如果我能解释一下我的理解的话,你想知道一场失败的比赛进入了一场比赛的程度。为了做到这一点,您需要能够解析正则表达式。

最好的正则表达式解析器可能是将 Perl 本身与 -re=debug 命令行开关一起使用:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{5}/'
Compiling REx "gh[ijkl]{5}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {5,5} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 7 
Guessing start of match in sv for REx "gh[ijkl]{5}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{5}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {5,5}(16)
                                  ANYOF[i-l][] can match 4 times out of 5...
                                  failed...
Match failed
Freeing REx: "gh[ijkl]{5}"

您可以使用正则表达式来解析 Perl 命令行并解析 stdout 的返回。查找 `

Here is a Matching regex:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{3}/'
Compiling REx "gh[ijkl]{3}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {3,3} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 5 
Guessing start of match in sv for REx "gh[ijkl]{3}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{3}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {3,3}(16)
                                  ANYOF[i-l][] can match 3 times out of 3...
  11 <ghijk> <lmnopqr>       | 16:  END(0)
Match successful!
Freeing REx: "gh[ijkl]{3}"

You will need to build a parser that can process 从 Perl 重新调试器返回。左手和右手尖括号显示正则表达式引擎尝试匹配时到字符串的距离。

顺便说一句,这不是一个容易的项目......

What you are proposing is difficult but doable.

If I can paraphrase what I understand, you are wanting to find out how far a failing match got into a match. In order to do this, you need to be able to parse a regex.

The best regex parser is probably to use Perl itself with the -re=debug command line switch:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{5}/'
Compiling REx "gh[ijkl]{5}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {5,5} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 7 
Guessing start of match in sv for REx "gh[ijkl]{5}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{5}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {5,5}(16)
                                  ANYOF[i-l][] can match 4 times out of 5...
                                  failed...
Match failed
Freeing REx: "gh[ijkl]{5}"

You can shell out that Perl command line with your regex and parse the return of stdout. Look for the `

Here is a matching regex:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{3}/'
Compiling REx "gh[ijkl]{3}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {3,3} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 5 
Guessing start of match in sv for REx "gh[ijkl]{3}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{3}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {3,3}(16)
                                  ANYOF[i-l][] can match 3 times out of 3...
  11 <ghijk> <lmnopqr>       | 16:  END(0)
Match successful!
Freeing REx: "gh[ijkl]{3}"

You will need to build a parser that can handle the return from the Perl re debugger. The left hand and right hand angle braces show the distance into the string as the regex engine is trying to match.

This is not an easy project btw...

最丧也最甜 2024-12-15 21:57:58

您可以获取匹配的部分,并使用 index 函数查找其位置:

my $x = 'abcdefghijklmnopqrstuvwxyz';

$x =~ /(g(h(o)?)?)/;
print index($x, $1) + length($1), "\n"; #8

You can get the matching part, and use the index function to find its position:

my $x = 'abcdefghijklmnopqrstuvwxyz';

$x =~ /(g(h(o)?)?)/;
print index($x, $1) + length($1), "\n"; #8
浮生未歇 2024-12-15 21:57:58

这似乎有效。基本上,这个想法是将正则表达式分割成它的组成部分并按顺序尝试它们,返回最后一个匹配位置。固定字符串需要拆分,但字符类和量词可以保留在一起。

理论上这应该可行,但可能需要调整。

use v5.10;
use strict;
use warnings;

my $string = 'abcdefghijklmnopqrstuvwxyz';
my $match  = partial_match($string, qw(g h (?=i) [ijkx]+ [lmn]+ z));
say "match ended at pos $match, character ", substr($string,$match,1);

sub partial_match {
    my $string = shift;
    my @rx = @_;
    my $pos;
    if ($string =~ /$rx[0]/g) {
        $pos = pos $string;
        if (defined $rx[1]) {
            splice @rx, 0, 2, $rx[0] . $rx[1];
            $pos = partial_match($string, @rx) // $pos;
        } else { return $pos }
    } else {
        say "Didn't match $rx[0]";
        return;
    }
}

This seems to work. Basically the idea is to split the regex into it's constituent parts and try them sequentially, returning the last matching position. The fixed strings need to be split up, but the character classes and quantifiers can be kept together.

In theory this should work, but it may need tweaking.

use v5.10;
use strict;
use warnings;

my $string = 'abcdefghijklmnopqrstuvwxyz';
my $match  = partial_match($string, qw(g h (?=i) [ijkx]+ [lmn]+ z));
say "match ended at pos $match, character ", substr($string,$match,1);

sub partial_match {
    my $string = shift;
    my @rx = @_;
    my $pos;
    if ($string =~ /$rx[0]/g) {
        $pos = pos $string;
        if (defined $rx[1]) {
            splice @rx, 0, 2, $rx[0] . $rx[1];
            $pos = partial_match($string, @rx) // $pos;
        } else { return $pos }
    } else {
        say "Didn't match $rx[0]";
        return;
    }
}
嘿哥们儿 2024-12-15 21:57:58

怎么样:

#!/usr/bin/perl 
use Modern::Perl;

my $x = 'abcdefghijklmnopqrstuvwxyz';
my $s = 'gho';
do {
    if ($x =~ /$s/) {
        say "$s matches from $-[0] to $+[0]";
    } else {
        say "$s doesn't match";
    }
} while chop $s;

输出:

gho doesn't match
gh matches from 6 to 8
g matches from 6 to 7
 matches from 0 to 0

How about:

#!/usr/bin/perl 
use Modern::Perl;

my $x = 'abcdefghijklmnopqrstuvwxyz';
my $s = 'gho';
do {
    if ($x =~ /$s/) {
        say "$s matches from $-[0] to $+[0]";
    } else {
        say "$s doesn't match";
    }
} while chop $s;

output:

gho doesn't match
gh matches from 6 to 8
g matches from 6 to 7
 matches from 0 to 0
波浪屿的海角声 2024-12-15 21:57:58

我认为这正是 pos 函数的用途。注意:pos 仅在使用 /g 标志时才有效

my $x = 'abcdefghijklmnopqrstuvwxyz';
my $end = 0;
if( $x =~ /$ARGV[0]/g )
{
    $end = pos($x);
}
print "End of match is: $end\n";

给出以下输出

[@centos5 ~]$ perl x.pl
End of match is: 0
[@centos5 ~]$ perl x.pl def
End of match is: 6
[@centos5 ~]$ perl x.pl xyz
End of match is: 26
[@centos5 ~]$ perl x.pl aaa
End of match is: 0
[@centos5 ~]$ perl x.pl ghi
End of match is: 9

I think thats exactly what the pos function is for. NOTE: pos only works if you use the /g flag

my $x = 'abcdefghijklmnopqrstuvwxyz';
my $end = 0;
if( $x =~ /$ARGV[0]/g )
{
    $end = pos($x);
}
print "End of match is: $end\n";

Gives the following output

[@centos5 ~]$ perl x.pl
End of match is: 0
[@centos5 ~]$ perl x.pl def
End of match is: 6
[@centos5 ~]$ perl x.pl xyz
End of match is: 26
[@centos5 ~]$ perl x.pl aaa
End of match is: 0
[@centos5 ~]$ perl x.pl ghi
End of match is: 9
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文