为什么回溯中的有限重复在某些风格中不起作用?

发布于 2024-09-08 01:17:49 字数 424 浏览 15 评论 0原文

我想解析 dd/mm/yy 格式的日期中间的 2 位数字,但也允许使用单个数字表示日和月。

这就是我的想法:

(?<=^[\d]{1,2}\/)[\d]{1,2}

我想要一个 1 或 2 位数字 [\d]{1,2} 以及 1 或 2 位数字和斜杠 ^[\d]{ 1,2}\/ 之前。

这不适用于许多组合,我已经测试了 10/10/1011/12/13 等...

但令我惊讶的是 ( ?<=^\d\d\/)[\d]{1,2} 有效。

但是如果 \d\d 匹配的话 [\d]{1,2} 也应该匹配,还是我错了?

I want to parse the 2 digits in the middle from a date in dd/mm/yy format but also allowing single digits for day and month.

This is what I came up with:

(?<=^[\d]{1,2}\/)[\d]{1,2}

I want a 1 or 2 digit number [\d]{1,2} with a 1 or 2 digit number and slash ^[\d]{1,2}\/ before it.

This doesn't work on many combinations, I have tested 10/10/10, 11/12/13, etc...

But to my surprise (?<=^\d\d\/)[\d]{1,2} worked.

But the [\d]{1,2} should also match if \d\d did, or am I wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

爱*していゐ 2024-09-15 01:17:49

关于后向支持

主要的正则表达式风格对后向有不同的支持;有些施加了某些限制,有些甚至根本不支持。

  • Javascript:不支持
  • Python:仅限固定长度
  • Java:仅限有限长度
  • .NET:无限制

参考文献


Python

在 Python 中,仅支持固定长度后向查找,您的原始模式会引发错误,因为 \d{1,2} 显然没有固定长度。您可以通过交替使用两个不同的固定长度lookbehind来“修复”此问题,例如:

(?<=^\d\/)\d{1,2}|(?<=^\d\d\/)\d{1,2}

或者您可以将两个lookbehind作为非捕获组的替代品:(

(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}

请注意,您可以只使用 \d< /code> 不带括号)。

也就是说,使用捕获组可能要简单得多:

^\d{1,2}\/(\d{1,2})

请注意 findall将返回组 1 捕获的内容。捕获组比lookbehind得到更广泛的支持,并且通常会导致更可读的模式(例如在本例中)。

此代码片段说明了上述所有要点:

p = re.compile(r'(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}')

print(p.findall("12/34/56"))   # "[34]"
print(p.findall("1/23/45"))    # "[23]"

p = re.compile(r'^\d{1,2}\/(\d{1,2})')

print(p.findall("12/34/56"))   # "[34]"
print(p.findall("1/23/45"))    # "[23]"

p = re.compile(r'(?<=^\d{1,2}\/)\d{1,2}')
# raise error("look-behind requires fixed-width pattern")

参考文献


Java

Java 仅支持有限长度后向查找,因此您可以像在原始模式中一样使用 \d{1,2} 。以下代码片段演示了这一点:

    String text =
        "12/34/56 date\n" +
        "1/23/45 another date\n";

    Pattern p = Pattern.compile("(?m)(?<=^\\d{1,2}/)\\d{1,2}");
    Matcher m = p.matcher(text);
    while (m.find()) {
        System.out.println(m.group());
    } // "34", "23"

请注意,(?m) 是嵌入的 Pattern.MULTILINE 以便 ^ 匹配每行的开头。另请注意,由于 \ 是字符串文字的转义字符,因此必须编写 "\\" 才能在 Java 中获得一个反斜杠。


C-Sharp

C# 支持lookbehind 的完整正则表达式。以下代码片段显示了如何在后行中使用 + 重复:

var text = @"
1/23/45
12/34/56
123/45/67
1234/56/78
";

Regex r = new Regex(@"(?m)(?<=^\d+/)\d{1,2}");
foreach (Match m in r.Matches(text)) {
  Console.WriteLine(m);
} // "23", "34", "45", "56"

请注意,与 Java 不同,在 C# 中,您可以使用 @-引号字符串 这样你就不必转义 \

为了完整起见,以下是在 C# 中使用捕获组选项的方法:

Regex r = new Regex(@"(?m)^\d+/(\d{1,2})");
foreach (Match m in r.Matches(text)) {
  Console.WriteLine("Matched [" + m + "]; month = " + m.Groups[1]);
}

根据前面的文本,将打印:

Matched [1/23]; month = 23
Matched [12/34]; month = 34
Matched [123/45]; month = 45
Matched [1234/56]; month = 56

相关问题

On lookbehind support

Major regex flavors have varying supports for lookbehind differently; some imposes certain restrictions, and some doesn't even support it at all.

  • Javascript: not supported
  • Python: fixed length only
  • Java: finite length only
  • .NET: no restriction

References


Python

In Python, where only fixed length lookbehind is supported, your original pattern raises an error because \d{1,2} obviously does not have a fixed length. You can "fix" this by alternating on two different fixed-length lookbehinds, e.g. something like this:

(?<=^\d\/)\d{1,2}|(?<=^\d\d\/)\d{1,2}

Or perhaps you can put both lookbehinds as alternates of a non-capturing group:

(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}

(note that you can just use \d without the brackets).

That said, it's probably much simpler to use a capturing group instead:

^\d{1,2}\/(\d{1,2})

Note that findall returns what group 1 captures if you only have one group. Capturing group is more widely supported than lookbehind, and often leads to a more readable pattern (such as in this case).

This snippet illustrates all of the above points:

p = re.compile(r'(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}')

print(p.findall("12/34/56"))   # "[34]"
print(p.findall("1/23/45"))    # "[23]"

p = re.compile(r'^\d{1,2}\/(\d{1,2})')

print(p.findall("12/34/56"))   # "[34]"
print(p.findall("1/23/45"))    # "[23]"

p = re.compile(r'(?<=^\d{1,2}\/)\d{1,2}')
# raise error("look-behind requires fixed-width pattern")

References


Java

Java supports only finite-length lookbehind, so you can use \d{1,2} like in the original pattern. This is demonstrated by the following snippet:

    String text =
        "12/34/56 date\n" +
        "1/23/45 another date\n";

    Pattern p = Pattern.compile("(?m)(?<=^\\d{1,2}/)\\d{1,2}");
    Matcher m = p.matcher(text);
    while (m.find()) {
        System.out.println(m.group());
    } // "34", "23"

Note that (?m) is the embedded Pattern.MULTILINE so that ^ matches the start of every line. Note also that since \ is an escape character for string literals, you must write "\\" to get one backslash in Java.


C-Sharp

C# supports full regex on lookbehind. The following snippet shows how you can use + repetition on a lookbehind:

var text = @"
1/23/45
12/34/56
123/45/67
1234/56/78
";

Regex r = new Regex(@"(?m)(?<=^\d+/)\d{1,2}");
foreach (Match m in r.Matches(text)) {
  Console.WriteLine(m);
} // "23", "34", "45", "56"

Note that unlike Java, in C# you can use @-quoted string so that you don't have to escape \.

For completeness, here's how you'd use the capturing group option in C#:

Regex r = new Regex(@"(?m)^\d+/(\d{1,2})");
foreach (Match m in r.Matches(text)) {
  Console.WriteLine("Matched [" + m + "]; month = " + m.Groups[1]);
}

Given the previous text, this prints:

Matched [1/23]; month = 23
Matched [12/34]; month = 34
Matched [123/45]; month = 45
Matched [1234/56]; month = 56

Related questions

找回味觉 2024-09-15 01:17:49

除非有问题中未注明的使用后视的具体原因,否则简单地匹配整个内容并仅捕获您感兴趣的部分怎么样?

JavaScript 示例:

>>> /^\d{1,2}\/(\d{1,2})\/\d{1,2}$/.exec("12/12/12")[1]
"12"

Unless there's a specific reason for using the lookbehind which isn't noted in the question, how about simply matching the whole thing and only capturing the bit you're interested in instead?

JavaScript example:

>>> /^\d{1,2}\/(\d{1,2})\/\d{1,2}$/.exec("12/12/12")[1]
"12"
不疑不惑不回忆 2024-09-15 01:17:49

引用regular-expressions.info

坏消息是大多数正则表达式
口味不允许你只使用
向后查找中的任何正则表达式,因为
他们无法应用正则表达式
向后。因此,定期
表达式引擎需要能够
计算出要后退多少步
在检查后向检查之前。

因此,许多正则表达式风格,
包括 Perl 使用的那些和
Python,只允许定长
字符串。您可以使用任何正则表达式
匹配的长度可以是
预定的。这意味着您可以使用
文字文本和字符类。
您不能使用重复或可选
项目。你可以使用交替,但是
仅当所有选项都在交替中时
长度相同。

换句话说,您的正则表达式不起作用,因为您在lookbehind中使用可变宽度表达式,并且您的正则表达式引擎不支持它。

To quote regular-expressions.info:

The bad news is that most regex
flavors do not allow you to use just
any regex inside a lookbehind, because
they cannot apply a regular expression
backwards. Therefore, the regular
expression engine needs to be able to
figure out how many steps to step back
before checking the lookbehind.

Therefore, many regex flavors,
including those used by Perl and
Python, only allow fixed-length
strings. You can use any regex of
which the length of the match can be
predetermined. This means you can use
literal text and character classes.
You cannot use repetition or optional
items. You can use alternation, but
only if all options in the alternation
have the same length.

In other words your regex does not work because you're using a variable-width expression inside a lookbehind and your regex engine does not support that.

爱已欠费 2024-09-15 01:17:49

除了 @polygenelubricants 列出的那些之外,“仅限固定长度”规则还有两个例外。在 PCRE(PHP、Apache 等的正则表达式引擎)和 Oniguruma(Ruby 1.9、Textmate)中,lookbehind 可能包含一个替换,其中每个替换可能匹配不同数量的字符,如下所示只要每个选项的长度是固定的。例如:

(?<=\b\d\d/|\b\d/)\d{1,2}(?=/\d{2}\b)

请注意,交替必须位于lookbehind 子表达式的顶层。您可能像我一样,试图分解出共同的元素,如下所示:

(?<=\b(?:\d\d/|\d)/)\d{1,2}(?=/\d{2}\b)

……但这行不通;在顶层,子表达式现在由具有非固定长度的单个替代项组成。

第二个例外更有用:\K,受 Perl 和 PCRE 支持。它实际上意味着“假装比赛真的从这里开始”。正则表达式中出现在其前面的任何内容都被视为积极的后向查找。与 .NET Lookbehind 一样,没有任何限制;正常正则表达式中出现的任何内容都可以在 \K 之前使用。

\b\d{1,2}/\K\d{1,2}(?=/\d{2}\b)

但大多数时候,当有人遇到向后查找问题时,事实证明他们甚至不应该使用它们。正如 @insin 指出的,通过使用捕获组可以更轻松地解决这个问题。

编辑:差点忘了 JGSoft,EditPad Pro 和 PowerGrep 使用的正则表达式风格;与 .NET 一样,它具有完全不受限制的后向查找(无论是正向查找还是负向查找)。

In addition to those listed by @polygenelubricants, there are two more exceptions to the "fixed length only" rule. In PCRE (the regex engine for PHP, Apache, et al) and Oniguruma (Ruby 1.9, Textmate), a lookbehind may consist of an alternation in which each alternative may match a different number of characters, as long as the length of each alternative is fixed. For example:

(?<=\b\d\d/|\b\d/)\d{1,2}(?=/\d{2}\b)

Note that the alternation has to be at the top level of the lookbehind subexpression. You might, like me, be tempted to factor out the common elements, like this:

(?<=\b(?:\d\d/|\d)/)\d{1,2}(?=/\d{2}\b)

...but it wouldn't work; at the top level, the subexpression now consists of a single alternative with a non-fixed length.

The second exception is much more useful: \K, supported by Perl and PCRE. It effectively means "pretend the match really started here." Whatever appears before it in the regex is treated as a positive lookbehind. As with .NET lookbehinds, there are no restrictions; whatever can appear in a normal regex can be used before the \K.

\b\d{1,2}/\K\d{1,2}(?=/\d{2}\b)

But most of the time, when someone has a problem with lookbehinds, it turns out they shouldn't even be using them. As @insin pointed out, this problem can be solved much more easily by using a capturing group.

EDIT: Almost forgot JGSoft, the regex flavor used by EditPad Pro and PowerGrep; like .NET, it has completely unrestricted lookbehinds, positive and negative.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文