当前位置：文江博客话题详情

正则表达式分割成重叠的字符串

发布于 2024-08-25 04:02:51 字数 881 浏览 5 评论 0 原文

我正在探索正则表达式的强大功能，所以我只是想知道这样的事情是否可能：

public class StringSplit {
    public static void main(String args[]) {
        System.out.println(
            java.util.Arrays.deepToString(
                "12345".split(INSERT_REGEX_HERE)
            )
        ); // prints "[12, 23, 34, 45]"
    }
}

如果可能，则只需提供正则表达式（并预先解释其工作原理）。

如果它只能在 Java 之外的某些正则表达式风格中实现，那么也可以随意提供这些。

如果不可能，请解释原因。

额外问题

同样的问题，但使用 find() 循环而不是 split：

    Matcher m = Pattern.compile(BONUS_REGEX).matcher("12345");
    while (m.find()) {
        System.out.println(m.group());
    } // prints "12", "23", "34", "45"

请注意，这并不是说我有一个具体的任务以一种或另一种方式完成，但我想理解正则表达式。我不需要做我想做的事情的代码；我想要正则表达式（如果存在），我可以在上面的代码中使用它们来完成任务（或者其他风格的正则表达式，可以将代码“直接翻译”为另一种语言）。

如果它们不存在，我想要一个很好的、可靠的解释。

原文

I'm exploring the power of regular expressions, so I'm just wondering if something like this is possible:

public class StringSplit {
    public static void main(String args[]) {
        System.out.println(
            java.util.Arrays.deepToString(
                "12345".split(INSERT_REGEX_HERE)
            )
        ); // prints "[12, 23, 34, 45]"
    }
}

If possible, then simply provide the regex (and preemptively some explanation on how it works).

If it's only possible in some regex flavors other than Java, then feel free to provide those as well.

If it's not possible, then please explain why.

BONUS QUESTION

Same question, but with a find() loop instead of split:

    Matcher m = Pattern.compile(BONUS_REGEX).matcher("12345");
    while (m.find()) {
        System.out.println(m.group());
    } // prints "12", "23", "34", "45"

Please note that it's not so much that I have a concrete task to accomplish one way or another, but rather I want to understand regular expressions. I don't need code that does what I want; I want regexes, if they exist, that I can use in the above code to accomplish the task (or regexes in other flavors that work with a "direct translation" of the code into another language).

And if they don't exist, I'd like a good solid explanation why.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

姜生凉生 2024-09-01 04:02:51

我认为使用 split() 不可能做到这一点，但使用 find() 则非常简单。只需使用内部带有捕获组的前瞻：

Matcher m = Pattern.compile("(?=(\\d\\d)).").matcher("12345");
while (m.find())
{
  System.out.println(m.group(1));
}

许多人没有意识到在前瞻或后视中捕获的文本可以在匹配后引用，就像任何其他捕获一样。在这种情况下，这是特别违反直觉的，因为捕获是“整个”匹配的超集。

事实上，即使整个正则表达式没有匹配任何内容，它也能工作。从上面的正则表达式中删除点 ("(?=(\\d\\d))")，您将得到相同的结果。这是因为，只要成功匹配不消耗任何字符，正则表达式引擎就会在尝试再次匹配之前自动前进一个位置，以防止无限循环。

不过，这种技术没有等效的 split() 方法，至少在 Java 中是这样。尽管您可以对环视和其他零宽度断言进行拆分，但无法使同一字符出现在多个结果子字符串中。

I don't think this is possible with split(), but with find() it's pretty simple. Just use a lookahead with a capturing group inside:

Matcher m = Pattern.compile("(?=(\\d\\d)).").matcher("12345");
while (m.find())
{
  System.out.println(m.group(1));
}

Many people don't realize that text captured inside a lookahead or lookbehind can be referenced after the match just like any other capture. It's especially counter-intuitive in this case, where the capture is a superset of the "whole" match.

As a matter of fact, it works even if the regex as a whole matches nothing. Remove the dot from the regex above ("(?=(\\d\\d))") and you'll get the same result. This is because, whenever a successful match consumes no characters, the regex engine automatically bumps ahead one position before trying to match again, to prevent infinite loops.

There's no split() equivalent for this technique, though, at least not in Java. Although you can split on lookarounds and other zero-width assertions, there's no way to get the same character to appear in more than one of the resulting substrings.

回复收藏 0 原文

卷耳 2024-09-01 04:02:51

使用 Matcher.find 而不是 split 的这种有点繁重的实现也可以工作，尽管此时您必须为这样的代码编写一个 for 循环。琐碎的任务，你不妨完全放弃正则表达式并使用子字符串（对于类似的编码复杂性减去CPU周期）：

import java.util.*;
import java.util.regex.*;

public class StringSplit { 
    public static void main(String args[]) { 
        ArrayList<String> result = new ArrayList<String>();
        for (Matcher m = Pattern.compile("..").matcher("12345"); m.find(result.isEmpty() ? 0 : m.start() + 1); result.add(m.group()));
        System.out.println( result.toString() ); // prints "[12, 23, 34, 45]" 
    } 
}

EDIT1

match()：迄今为止没有人能够编造正则的原因像 BONUS_REGEX 这样的表达式位于 Matcher 中，它将继续寻找上一组结束的下一组（即没有重叠），而不是上一组开始的位置之后——也就是说，没有明确地重新指定开始搜索位置（上面）。 BONUS_REGEX 的一个很好的候选者是 "(.\\G.|^..)"，但不幸的是，\G-中间锚技巧不适用于 Java 的 Match（但在 Perl 中工作得很好）：

 perl -e 'while ("12345"=~/(^..|.\G.)/g) { print "$1\n" }'
 12
 23
 34
 45

split()：与 INSERT_REGEX_HERE 一样code> 一个好的候选者是 (?<=..)(?=..) （分割点是零宽度位置，其中我的右侧有两个字符，我的右侧有两个字符左），但同样，因为 split 不会出现任何重叠，所以最终会得到 [12, 3, 45] （很接近，但没有雪茄。）

EDIT2

为了好玩，您可以通过首先将非边界字符加倍来欺骗 split() 来完成您想要的操作（这里您需要一个保留字符值来分割）：

Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1#$1").split("#")

我们可以聪明地消除对保留字符的需要通过利用零宽度前瞻断言（与后视不同）可以具有无限长度的事实来改变字符；因此，我们可以围绕距离双倍字符串末尾偶数个字符的所有点（距离其开头至少两个字符）进行分割，产生与上面相同的结果

Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1").split("(?<=..)(?=(..)*$)")

： match() 以类似的方式（但不需要保留字符值）：

Matcher m = Pattern.compile("..").matcher(
  Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1")
);
while (m.find()) { 
    System.out.println(m.group()); 
} // prints "12", "23", "34", "45"

This somewhat heavy implementation using Matcher.find instead of split will also work, although by the time you have to code a for loop for such a trivial task you might as well drop the regular expressions altogether and use substrings (for similar coding complexity minus the CPU cycles):

import java.util.*;
import java.util.regex.*;

public class StringSplit { 
    public static void main(String args[]) { 
        ArrayList<String> result = new ArrayList<String>();
        for (Matcher m = Pattern.compile("..").matcher("12345"); m.find(result.isEmpty() ? 0 : m.start() + 1); result.add(m.group()));
        System.out.println( result.toString() ); // prints "[12, 23, 34, 45]" 
    } 
}

EDIT1

match(): the reason why nobody so far has been able to concoct a regular expression like your BONUS_REGEX lies within Matcher, which will resume looking for the next group where the previous group ended (i.e. no overlap), as oposed to after where the previous group started -- that is, short of explicitly respecifying the start search position (above). A good candidate for BONUS_REGEX would have been "(.\\G.|^..)" but, unfortunately, the \G-anchor-in-the-middle trick doesn't work with Java's Match (but works just fine in Perl):

 perl -e 'while ("12345"=~/(^..|.\G.)/g) { print "$1\n" }'
 12
 23
 34
 45

split(): as for INSERT_REGEX_HERE a good candidate would have been (?<=..)(?=..) (split point is the zero-width position where I have two characters to my right and two to my left), but again, because split concieves naught of overlap you end up with [12, 3, 45] (which is close, but no cigar.)

EDIT2

For fun, you can trick split() into doing what you want by first doubling non-boundary characters (here you need a reserved character value to split around):

Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1#$1").split("#")

We can be smart and eliminate the need for a reserved character by taking advantage of the fact that zero-width look-ahead assertions (unlike look-behind) can have an unbounded length; we can therefore split around all points which are an even number of characters away from the end of the doubled string (and at least two characters away from its beginning), producing the same result as above:

Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1").split("(?<=..)(?=(..)*$)")

Alternatively tricking match() in a similar way (but without the need for a reserved character value):

Matcher m = Pattern.compile("..").matcher(
  Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1")
);
while (m.find()) { 
    System.out.println(m.group()); 
} // prints "12", "23", "34", "45"

回复收藏 0 原文

苦行僧 2024-09-01 04:02:51

Split 将字符串切成多个部分，但这不允许重叠。您需要使用循环来获得重叠的部分。

回复收藏 0 原文

内心激荡 2024-09-01 04:02:51

我认为你不能用 split() 来做到这一点，因为它会丢弃与正则表达式匹配的部分。

在 Perl 中，这是有效的：

my $string = '12345';
my @array = ();
while ( $string =~ s/(\d(\d))/$2/ ) {
    push(@array, $1);
}
print join(" ", @array);
# prints: 12 23 34 45

查找和替换表达式表示：匹配前两个相邻数字，并仅用这两个数字中的第二个数字替换字符串中的它们。

I don't think you can do this with split() because it throws away the part that matches the regular expression.

In Perl this works:

my $string = '12345';
my @array = ();
while ( $string =~ s/(\d(\d))/$2/ ) {
    push(@array, $1);
}
print join(" ", @array);
# prints: 12 23 34 45

The find-and-replace expression says: match the first two adjacent digits and replace them in the string with just the second of the two digits.

回复收藏 0 原文

め可乐爱微笑 2024-09-01 04:02:51

或者，使用 Perl 的简单匹配。应该在前瞻可以工作的任何地方工作。这里不需要循环。

 $_ = '12345';
 @list = /(?=(..))./g;
 print "@list";

 # Output:
 # 12 23 34 45

但是，如之前发布的，如果 \G 技巧有效，则效果会更好：

 $_ = '12345';
 @list = /^..|.\G./g;
 print "@list";

 # Output:
 # 12 23 34 45

编辑：抱歉，没有看到所有内容都已发布。

Alternative, using plain matching with Perl. Should work anywhere where lookaheads do. And no need for loops here.

 $_ = '12345';
 @list = /(?=(..))./g;
 print "@list";

 # Output:
 # 12 23 34 45

But this one, as posted earlier, is nicer if the \G trick works:

 $_ = '12345';
 @list = /^..|.\G./g;
 print "@list";

 # Output:
 # 12 23 34 45

Edit: Sorry, didn't see that all of this was posted already.

回复收藏 0 原文

甲如呢乙后呢 2024-09-01 04:02:51

使用 String#split 是不可能的，正如其他答案已经指出的那样。但是，可以在其之前添加正则表达式替换来准备字符串，然后使用拆分来创建常规对：

"12345".replaceAll(".(?=(.).)","$0$1")
       .split("(?<=\\G..)")

.replaceAll(".(?=(.).)","$0$1" ) 会将 "12345" 转换为 "12233445"。它基本上将每个 123 子字符串替换为 1223，然后将每个 234 替换为 2334（请注意，它是重叠的），等等换句话说，它将复制除第一个和最后一个字符之外的每个字符。

.(?=(.).)  # Replace-regex:
.          #  A single character
 (?=    )  #  followed by (using a positive lookahead):
     . .   #   two more characters
    ( )    #   of which the first is saved in capture group 1

$0$1       # Replacement-regex:
$0         #  The entire match, which is the character itself since everything
           #  else was inside a lookahead
  $1       #  followed by capture group 1

之后， .split("(?<=\\G..)") 会将这个新字符串分成对：

(?<=\G..) # Split-regex:
(?<=    ) #  A positive lookbehind:
    \G    #   Matching the end of the previous match
          #   (or the start of the string initially)
      ..  #   followed by two characters

有关 .split("(?<; =\\G..)") 可以在此处。

在线尝试。

Creating overlapping matches with String#split isn't possible, as the other answers have already stated. It is however possible to add a regex-replace before it to prepare the String, and then use the split to create regular pairs:

"12345".replaceAll(".(?=(.).)","$0$1")
       .split("(?<=\\G..)")

The .replaceAll(".(?=(.).)","$0$1") will transform "12345" into "12233445". It basically replaces every 123 substring to 1223, then every 234 to 2334 (note that it's overlapping), etc. In other words, it'll duplicate every character, except for the first and last.

.(?=(.).)  # Replace-regex:
.          #  A single character
 (?=    )  #  followed by (using a positive lookahead):
     . .   #   two more characters
    ( )    #   of which the first is saved in capture group 1

$0$1       # Replacement-regex:
$0         #  The entire match, which is the character itself since everything
           #  else was inside a lookahead
  $1       #  followed by capture group 1

After that, .split("(?<=\\G..)") will split this new String into pairs:

(?<=\G..) # Split-regex:
(?<=    ) #  A positive lookbehind:
    \G    #   Matching the end of the previous match
          #   (or the start of the string initially)
      ..  #   followed by two characters

Some more information about .split("(?<=\\G..)") can be found here.

Try it online.

回复收藏 0 原文

~没有更多了~

关于作者

奢华的一滴泪

暂无简介

0 文章

0 评论

710 人气

关注发私信

友情链接

文江博客

正则表达式分割成重叠的字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

EDIT1

EDIT2

EDIT1

EDIT2

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

正则表达式分割成重叠的字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

EDIT1

EDIT2

EDIT1

EDIT2

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。