我正在探索正则表达式的强大功能,所以我只是想知道这样的事情是否可能:
public class StringSplit {
public static void main(String args[]) {
System.out.println(
java.util.Arrays.deepToString(
"12345".split(INSERT_REGEX_HERE)
)
); // prints "[12, 23, 34, 45]"
}
}
如果可能,则只需提供正则表达式(并预先解释其工作原理)。
如果它只能在 Java 之外的某些正则表达式风格中实现,那么也可以随意提供这些。
如果不可能,请解释原因。
额外问题
同样的问题,但使用 find()
循环而不是 split
:
Matcher m = Pattern.compile(BONUS_REGEX).matcher("12345");
while (m.find()) {
System.out.println(m.group());
} // prints "12", "23", "34", "45"
请注意,这并不是说我有一个具体的任务以一种或另一种方式完成,但我想理解正则表达式。我不需要做我想做的事情的代码;我想要正则表达式(如果存在),我可以在上面的代码中使用它们来完成任务(或者其他风格的正则表达式,可以将代码“直接翻译”为另一种语言)。
如果它们不存在,我想要一个很好的、可靠的解释。
I'm exploring the power of regular expressions, so I'm just wondering if something like this is possible:
public class StringSplit {
public static void main(String args[]) {
System.out.println(
java.util.Arrays.deepToString(
"12345".split(INSERT_REGEX_HERE)
)
); // prints "[12, 23, 34, 45]"
}
}
If possible, then simply provide the regex (and preemptively some explanation on how it works).
If it's only possible in some regex flavors other than Java, then feel free to provide those as well.
If it's not possible, then please explain why.
BONUS QUESTION
Same question, but with a find()
loop instead of split
:
Matcher m = Pattern.compile(BONUS_REGEX).matcher("12345");
while (m.find()) {
System.out.println(m.group());
} // prints "12", "23", "34", "45"
Please note that it's not so much that I have a concrete task to accomplish one way or another, but rather I want to understand regular expressions. I don't need code that does what I want; I want regexes, if they exist, that I can use in the above code to accomplish the task (or regexes in other flavors that work with a "direct translation" of the code into another language).
And if they don't exist, I'd like a good solid explanation why.
发布评论
评论(6)
我认为使用
split()
不可能做到这一点,但使用find()
则非常简单。只需使用内部带有捕获组的前瞻:许多人没有意识到在前瞻或后视中捕获的文本可以在匹配后引用,就像任何其他捕获一样。在这种情况下,这是特别违反直觉的,因为捕获是“整个”匹配的超集。
事实上,即使整个正则表达式没有匹配任何内容,它也能工作。从上面的正则表达式中删除点 (
"(?=(\\d\\d))"
),您将得到相同的结果。这是因为,只要成功匹配不消耗任何字符,正则表达式引擎就会在尝试再次匹配之前自动前进一个位置,以防止无限循环。不过,这种技术没有等效的
split()
方法,至少在 Java 中是这样。尽管您可以对环视和其他零宽度断言进行拆分,但无法使同一字符出现在多个结果子字符串中。I don't think this is possible with
split()
, but withfind()
it's pretty simple. Just use a lookahead with a capturing group inside:Many people don't realize that text captured inside a lookahead or lookbehind can be referenced after the match just like any other capture. It's especially counter-intuitive in this case, where the capture is a superset of the "whole" match.
As a matter of fact, it works even if the regex as a whole matches nothing. Remove the dot from the regex above (
"(?=(\\d\\d))"
) and you'll get the same result. This is because, whenever a successful match consumes no characters, the regex engine automatically bumps ahead one position before trying to match again, to prevent infinite loops.There's no
split()
equivalent for this technique, though, at least not in Java. Although you can split on lookarounds and other zero-width assertions, there's no way to get the same character to appear in more than one of the resulting substrings.使用
Matcher.find
而不是split
的这种有点繁重的实现也可以工作,尽管此时您必须为这样的代码编写一个for
循环。琐碎的任务,你不妨完全放弃正则表达式并使用子字符串(对于类似的编码复杂性减去CPU周期):EDIT1
match()
:迄今为止没有人能够编造正则的原因像BONUS_REGEX
这样的表达式位于Matcher
中,它将继续寻找上一组结束的下一组(即没有重叠),而不是上一组开始的位置之后——也就是说,没有明确地重新指定开始搜索位置(上面)。BONUS_REGEX
的一个很好的候选者是"(.\\G.|^..)"
,但不幸的是,\G
-中间锚技巧不适用于 Java 的Match
(但在 Perl 中工作得很好):split()
:与INSERT_REGEX_HERE
一样code> 一个好的候选者是(?<=..)(?=..)
(分割点是零宽度位置,其中我的右侧有两个字符,我的右侧有两个字符左),但同样,因为split
不会出现任何重叠,所以最终会得到[12, 3, 45]
(很接近,但没有雪茄。)EDIT2
为了好玩,您可以通过首先将非边界字符加倍来欺骗
split()
来完成您想要的操作(这里您需要一个保留字符值来分割):我们可以聪明地消除对保留字符的需要通过利用零宽度前瞻断言(与后视不同)可以具有无限长度的事实来改变字符;因此,我们可以围绕距离双倍字符串末尾偶数个字符的所有点(距离其开头至少两个字符)进行分割,产生与上面相同的结果
:
match()
以类似的方式(但不需要保留字符值):This somewhat heavy implementation using
Matcher.find
instead ofsplit
will also work, although by the time you have to code afor
loop for such a trivial task you might as well drop the regular expressions altogether and use substrings (for similar coding complexity minus the CPU cycles):EDIT1
match()
: the reason why nobody so far has been able to concoct a regular expression like yourBONUS_REGEX
lies withinMatcher
, which will resume looking for the next group where the previous group ended (i.e. no overlap), as oposed to after where the previous group started -- that is, short of explicitly respecifying the start search position (above). A good candidate forBONUS_REGEX
would have been"(.\\G.|^..)"
but, unfortunately, the\G
-anchor-in-the-middle trick doesn't work with Java'sMatch
(but works just fine in Perl):split()
: as forINSERT_REGEX_HERE
a good candidate would have been(?<=..)(?=..)
(split point is the zero-width position where I have two characters to my right and two to my left), but again, becausesplit
concieves naught of overlap you end up with[12, 3, 45]
(which is close, but no cigar.)EDIT2
For fun, you can trick
split()
into doing what you want by first doubling non-boundary characters (here you need a reserved character value to split around):We can be smart and eliminate the need for a reserved character by taking advantage of the fact that zero-width look-ahead assertions (unlike look-behind) can have an unbounded length; we can therefore split around all points which are an even number of characters away from the end of the doubled string (and at least two characters away from its beginning), producing the same result as above:
Alternatively tricking
match()
in a similar way (but without the need for a reserved character value):Split 将字符串切成多个部分,但这不允许重叠。您需要使用循环来获得重叠的部分。
Split chops a string into multiple pieces, but that doesn't allow for overlap. You'd need to use a loop to get overlapping pieces.
我认为你不能用 split() 来做到这一点,因为它会丢弃与正则表达式匹配的部分。
在 Perl 中,这是有效的:
查找和替换表达式表示:匹配前两个相邻数字,并仅用这两个数字中的第二个数字替换字符串中的它们。
I don't think you can do this with split() because it throws away the part that matches the regular expression.
In Perl this works:
The find-and-replace expression says: match the first two adjacent digits and replace them in the string with just the second of the two digits.
或者,使用 Perl 的简单匹配。应该在前瞻可以工作的任何地方工作。这里不需要循环。
但是,如之前发布的,如果 \G 技巧有效,则效果会更好:
编辑:抱歉,没有看到所有内容都已发布。
Alternative, using plain matching with Perl. Should work anywhere where lookaheads do. And no need for loops here.
But this one, as posted earlier, is nicer if the \G trick works:
Edit: Sorry, didn't see that all of this was posted already.
使用
String#split
是不可能的,正如其他答案已经指出的那样。但是,可以在其之前添加正则表达式替换来准备字符串,然后使用拆分来创建常规对:.replaceAll(".(?=(.).)","$0$1" )
会将"12345"
转换为"12233445"
。它基本上将每个123
子字符串替换为1223
,然后将每个234
替换为2334
(请注意,它是重叠的),等等换句话说,它将复制除第一个和最后一个字符之外的每个字符。之后,
.split("(?<=\\G..)")
会将这个新字符串分成对:有关
.split("(?<; =\\G..)")
可以在此处。在线尝试。
Creating overlapping matches with
String#split
isn't possible, as the other answers have already stated. It is however possible to add a regex-replace before it to prepare the String, and then use the split to create regular pairs:The
.replaceAll(".(?=(.).)","$0$1")
will transform"12345"
into"12233445"
. It basically replaces every123
substring to1223
, then every234
to2334
(note that it's overlapping), etc. In other words, it'll duplicate every character, except for the first and last.After that,
.split("(?<=\\G..)")
will split this new String into pairs:Some more information about
.split("(?<=\\G..)")
can be found here.Try it online.