正则表达式匹配最长的重复子字符串

发布于 2025-01-04 02:20:43 字数 1109 浏览 0 评论 0原文

我正在编写正则表达式来检查是否存在一个子字符串，该子字符串包含至少 2 个彼此相邻的模式重复。我将正则表达式的结果与前一个字符串进行匹配 - 如果相等，则存在这样的模式。更好地举例说明：1010 包含模式 10，并且它在连续系列中出现了 2 次。另一方面，10210 不会有这样的模式，因为这 10 个不相邻。

更重要的是，我需要找到可能的最长模式，并且它的长度至少为1。我已经编写了表达式来检查它 ^.*?(.+)(\1).*?$.为了找到最长的模式，我使用非贪婪版本来匹配模式之前的某些内容，然后模式与组 1 匹配，并再次匹配与组 1 匹配的相同内容。然后匹配字符串的其余部分，产生相等的字符串。但是存在一个问题，正则表达式在找到第一个模式后急于返回，并且没有真正考虑到我打算使之前和之后的这些子字符串尽可能短（使其余部分尽可能最长）。因此，从字符串 01011010 中，我正确地得知存在匹配，但存储在组 1 中的模式只是 01，尽管我排除了 101。

因为我相信我不能让模式“更贪婪”，或者在“更非贪婪”之前和之后垃圾，所以我只能想出一个让正则表达式不那么急切的想法，但我不确定这是否可能。

更多示例：

56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match

使用当前表达式我会得到什么（使用相同的字符串）：

no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match

原文

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.

What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.

As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.

Further examples:

56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match

What I would get using current expression (same strings used):

no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梅倚清风 2025-01-11 02:20:43

在 Perl 中，您可以在 (??{ code })：

$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;

输出：

这里发生的情况是，在匹配的连续子字符串对之后，我们通过负向前视确保后面不再有对。

为了制作较长对的表达式，使用了延迟子表达式构造 (??{ code })，它（每次）都会计算内部代码并使用返回的字符串作为表达式。

它构造的子表达式的形式为 .+?(..{N,})\1，其中 N 是第一个捕获组的当前长度 (length($^N)< /code>, $^N 包含前一个捕获组的当前值）。

因此，完整的表达式将具有以下形式：

(?=(.+)\1)(?!.+?(..{N,})\2}))

使用神奇的 N （并且第二个捕获组不是原始表达式的“真实”/正确捕获组）。

使用示例：

use v5.10;

sub longest_rep{
    $_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}

say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';

输出：

In Perl you can do it with one expression with help of (??{ code }):

$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;

Output:

What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.

To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.

The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).

Thus the full expression would have the form:

(?=(.+)\1)(?!.+?(..{N,})\2}))

With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).

Usage example:

use v5.10;

sub longest_rep{
    $_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}

say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';

Output:

回复收藏 0 原文

自在安然 2025-01-11 02:20:43

您可以在单个正则表达式中完成此操作，只需从结果列表中手动选择最长的匹配项即可。

def longestrepeating(strg):
    regex = re.compile(r"(?=(.+)\1)")
    matches = regex.findall(strg)
    if matches:
        return max(matches, key=len)

这将为您提供（因为 re.findall() 返回匹配捕获组的列表，即使匹配本身的长度为零）：

>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']

You can do it in a single regex, you just have to pick the longest match from the list of results manually.

def longestrepeating(strg):
    regex = re.compile(r"(?=(.+)\1)")
    matches = regex.findall(strg)
    if matches:
        return max(matches, key=len)

This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):

>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']

回复收藏 0 原文

爱的故事 2025-01-11 02:20:43

这是一个很长的脚本，可以满足您的要求。它基本上会遍历您的输入字符串，将其缩短一，然后再次遍历它。一旦找到所有可能的匹配项，它就会返回最长的匹配项之一。可以对其进行调整，以便返回所有最长的匹配项，而不是仅返回一个，但我将把它留给您。

这是非常基本的代码，但希望您能理解它的要点。

use v5.10;
use strict;
use warnings;

while (<DATA>) {
    chomp;
    print "$_ : ";
    my $longest = foo($_);
    if ($longest) {
        say $longest;
    } else {
        say "No matches found";
    }
}

sub foo {
    my $num = shift;
    my @hits;
    for my $i (0 .. length($num)) {
        my $part = substr $num, $i;
        push @hits, $part =~ /(.+)(?=\1)/g;
    }
    my $long = shift @hits;
    for (@hits) {
        if (length($long) < length) {
            $long = $_;
        }
    }
    return $long;
}

__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919

Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.

It's pretty rudimentary code, but hopefully you'll get the gist of it.

use v5.10;
use strict;
use warnings;

while (<DATA>) {
    chomp;
    print "$_ : ";
    my $longest = foo($_);
    if ($longest) {
        say $longest;
    } else {
        say "No matches found";
    }
}

sub foo {
    my $num = shift;
    my @hits;
    for my $i (0 .. length($num)) {
        my $part = substr $num, $i;
        push @hits, $part =~ /(.+)(?=\1)/g;
    }
    my $long = shift @hits;
    for (@hits) {
        if (length($long) < length) {
            $long = $_;
        }
    }
    return $long;
}

__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919

回复收藏 0 原文

以可爱出名 2025-01-11 02:20:43

不确定是否有人想到过这一点...

my $originalstring="pdxabababqababqh1234112341";

my $max=int(length($originalstring)/2);
my @result;
foreach my $n (reverse(1..$max)) {
    @result=$originalstring=~m/(.{$n})\1/g;
    last if @result;
}

print join(",",@result),"\n";

最长的双倍匹配不能超过原始字符串长度的一半，所以我们从那里开始倒数。

如果怀疑匹配项相对于原始字符串的长度来说很小，那么这个想法可以颠倒过来……我们不是向下计数直到找到匹配项，而是向上计数直到没有更多匹配项。然后我们需要备份 1 并给出结果。我们还需要在正则表达式中的 $n 后面放置一个逗号。

my $n;
foreach (1..$max) {
    unless (@result=$originalstring=~m/(.{$_,})\1/g) {
        $n=--$_;
        last;
    }
}
@result=$originalstring=~m/(.{$n})\1/g;

print join(",",@result),"\n";

Not sure if anyone's thought of this...

my $originalstring="pdxabababqababqh1234112341";

my $max=int(length($originalstring)/2);
my @result;
foreach my $n (reverse(1..$max)) {
    @result=$originalstring=~m/(.{$n})\1/g;
    last if @result;
}

print join(",",@result),"\n";

The longest doubled match cannot exceed half the length of the original string, so we count down from there.

If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.

my $n;
foreach (1..$max) {
    unless (@result=$originalstring=~m/(.{$_,})\1/g) {
        $n=--$_;
        last;
    }
}
@result=$originalstring=~m/(.{$n})\1/g;

print join(",",@result),"\n";

回复收藏 0 原文

骄兵必败 2025-01-11 02:20:43

正则表达式有助于解决这个问题，但我认为您不能将其作为单个表达式来完成，因为您想要找到最长的成功匹配，而正则表达式只是查找它们的第一个匹配可以找到。贪婪可以用来调整首先找到哪个匹配（字符串中较早和较晚的匹配），但我想不出一种方法来选择较早、较长的子字符串而不是较晚的子字符串，较短的子字符串，同时也更喜欢较晚的，较长的子字符串而不是较早的，较短的子字符串。

使用正则表达式的一种方法是按降序迭代可能的长度，并在找到指定长度的匹配项后立即退出：

my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
  if($s =~ m/(.{$i})\1/)
  {
    $one = $1;
    last;
  }
}
# now $one is '101'

Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.

One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:

my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
  if($s =~ m/(.{$i})\1/)
  {
    $one = $1;
    last;
  }
}
# now $one is '101'

回复收藏 0 原文

~没有更多了~