匹配 Razor 之类表达式的正则表达式

发布于 2024-12-10 00:39:57 字数 684 浏览 0 评论 0原文

我一直在试图找出如何匹配类似 Razor 的嵌入表达式。这不是真正的 Razor 语法，只是类似的语法。

示例：

给定以下字符串：

这个@ShouldMatch1和这个@ShouldMatch2还有这个 @((ShouldNotMatch1)andthis@ShouldMatch3) 和这个@(ShouldNotMatch2 这个@1ShouldNotMatch3 和这个@((ShouldNotMatch4 和这个 @(ShouldMatch4))

匹配并捕获：
- 应该匹配1
- ShouldMatch2
- ShouldMatch3
- 应该匹配4

基本上，这里是要求：

如果它以 @ 开头，然后是 [a-zA-Z]+[0-9]* 那么我想匹配它。
如果它以 @( 开头，那么我只想匹配后面跟着 [a-zA-Z]+[0-9]* 然后是 ) 的情况。

这是我作为开始的内容，它在大部分情况下都有效，但它匹配 ShouldNotMatch2

\@[(]?([a-zA-Z]+[0-9]*[)]*)

原文

I've been trying to figure out how to match Razor-like embedded expressions. This isn't true Razor syntax, just something similar.

Example:

Given the following string:

This @ShouldMatch1 and this @ShouldMatch2 and this
@((ShouldNotMatch1)andthis@ShouldMatch3) and this @(ShouldNotMatch2
and this @1ShouldNotMatch3 and this @((ShouldNotMatch4 and this
@(ShouldMatch4))

Match and Capture:
- ShouldMatch1
- ShouldMatch2
- ShouldMatch3
- ShouldMatch4

Basically, here are the requiremetns:

if it starts with @ and then [a-zA-Z]+[0-9]* then I want to match it.
if it starts with @(, then I only want to match if it's followed by [a-zA-Z]+[0-9]* and then a ).

Here's what I have as a start, and it's working for the most part, but it's matching ShouldNotMatch2

\@[(]?([a-zA-Z]+[0-9]*[)]*)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自由如风 2024-12-17 00:39:57

如果您的正则表达式引擎支持条件：

@(\()?([A-Za-z]+[0-9]*)(?(1)\))

说明：

@           # Match @
(\()?       # Optionally match `(` and capture in group 1
(           # Match and capture in group 2
 [A-Za-z]+  # 1+ ASCII letters
 [0-9]*     # 0+ ASCII digits
)           # End of capturing group
(?(1)       # If group 1 participated in the match
 \)         # match a closing parenthesis
)           # End of conditional

If your regex engine supports conditionals:

@(\()?([A-Za-z]+[0-9]*)(?(1)\))

Explanation:

@           # Match @
(\()?       # Optionally match `(` and capture in group 1
(           # Match and capture in group 2
 [A-Za-z]+  # 1+ ASCII letters
 [0-9]*     # 0+ ASCII digits
)           # End of capturing group
(?(1)       # If group 1 participated in the match
 \)         # match a closing parenthesis
)           # End of conditional

回复收藏 0 原文

°如果伤别离去 2024-12-17 00:39:57

此代码：

#!/usr/bin/env perl

$_ = <<'LA VISTA, BABY';  # the Terminator, of course :)
    This @ShouldMatch1 and this @ShouldMatch2 and this @((ShouldNotMatch1)andthis@ShouldMatch3) and this @(ShouldNotMatch2 an
d this @1ShouldNotMatch3 and this @((ShouldNotMatch4 and this @(ShouldMatch4))'
LA VISTA, BABY

print $+{id}, "\n" while m{
    @ (?: \(  (?<id> \pL+ \d* )  \)
        |     (?<id> \pL+ \d* )
      )
}gx;

运行时将打印出您想要的输出：

ShouldMatch1
ShouldMatch2
ShouldMatch3
ShouldMatch4

编辑

这是对先前解决方案从更简单到更精美的五个阶段的阐述。然而，目前还不清楚标识符的真正规则是什么或应该是什么。

原始问题： \pL+\d* 这表示它以字母开头，然后可能以数字结尾，但并非必须如此。
\pL+\d+ 强制使用数字。
\pL[\pL\d]* 必须以字母开头，但允许字母和数字混合。
\pL\w* 为首字母后面的内容添加下划线。从技术上讲，根据 UTS#18 的 \w 应该是所有字母、所有标记、字母数字（如罗马数字）、所有十进制数字以及所有连接标点符号。
\w+ 适用于所有字母、数字或下划线，没有限制。这就是根据标准的单词字符。
(?=\pL)\w+(?<=\d) 添加了一个约束，即它必须以字母开头并以数字结尾，但否则可以是单词字符的组合。

无论实际需要哪一个（目前还不清楚），更新代码以使用适当的变体应该很容易，特别是在最后两个版本中，这些有趣的标识符的定义仅出现在一个地方代码。这样就可以轻松地在一处进行更改，并消除更新不一致的错误。程序员应该始终努力排除重复的代码，无论他们的编程语言如何，甚至正则表达式都是如此，因为抽象是良好设计的基础。

这是 5 路版本：

#!/usr/bin/env perl

$_ = <<'LA VISTA, BABY';  # the Terminator, of course :)
    This @ShouldMatch1 and this @ShouldMatch2 and this @((ShouldNotMatch1)andthis@ShouldMatch3) and this @(ShouldNotMatch2 and this @1ShouldNotMatch3 and this @((ShouldNotMatch4 and this @(ShouldMatch4))'
LA VISTA, BABY

$mask  = "Version %d: %s\n";
$verno = 0;

##########################################################
# Simplest version: nothing fancy
++$verno;
printf($mask, $verno, $+) while /\@(?:(\pL+\d*)|\((\pL+\d*)\))/g;
print "\n";

##########################################################
# More readable version: add /x for spacing out regex contents
++$verno;
printf($mask, $verno, $+) while / \@ (?: (\pL+\d*) | \( (\pL+\d*) \) ) /xg;
print "\n";

##########################################################
# Use vertical alignment for greatly improved legibility,
# plus named captures for convenience and self-documentation
++$verno;
printf($mask, $verno, $+{id}) while m{
    @ (?: \(  (?<id> \pL+ \d* )  \)
        |     (?<id> \pL+ \d* )
      )
}xg;
print "\n";

##########################################################
# Define the "id" pattern separately from executing it
# to avoid code duplication. Improves maintainability.
# Likely requires Perl 5.10 or better, or PCRE, or PHP.
++$verno;
printf($mask, $verno, $+)     while m{
    (?(DEFINE)  (?<id> \pL+ \d* )   )

    @ (?: \( ((?&id)) \)
        |    ((?&id))
      )
}xg;
print "\n";

##########################################################
# this time we use a named capture that is different from
# the named group used for the definttion.
++$verno;
printf($mask, $verno, $+{id}) while m{
    (?(DEFINE)  (?<word> \pL+ \d* )   )

    @ (?: \( (?<id> (?&word) ) \)
        |    (?<id> (?&word) )
      )
}xg;

当在 Perl v5.10 或更高版本上运行时，会适当地生成：

Version 1: ShouldMatch1
Version 1: ShouldMatch2
Version 1: ShouldMatch3
Version 1: ShouldMatch4

Version 2: ShouldMatch1
Version 2: ShouldMatch2
Version 2: ShouldMatch3
Version 2: ShouldMatch4

Version 3: ShouldMatch1
Version 3: ShouldMatch2
Version 3: ShouldMatch3
Version 3: ShouldMatch4

Version 4: ShouldMatch1
Version 4: ShouldMatch2
Version 4: ShouldMatch3
Version 4: ShouldMatch4

Version 5: ShouldMatch1
Version 5: ShouldMatch2
Version 5: ShouldMatch3
Version 5: ShouldMatch4

更新 id 的定义以匹配实际需要的内容应该很容易。

请注意，某些正则表达式引擎使指定属性变得非常麻烦。例如，它们可能需要 \p{L} 而不是普通的 \pL。这是他们设计错误的霍夫曼编码失败，因为你总是希望最常用的形式是最短的形式。由于 \pL 或 \pN 只比 \w 或 \d 长一个字符，因此人们更容易理解倾向于使用改进的版本，但像 \p{L} 和 \p{N} 这样的东西现在比 \w 长三个字符> 和 \d，加上启动时不必要的视觉混乱。您不必为了获得“正常”ᴀᴋᴀ 最常见的情况而支付三倍的费用。 :(

如果您要放入难看的大括号，那么您不妨将其完整写为 \p{Letter} 和 \p {Number} 毕竟，正如他们所说，“一分钱一分货”。

This code:

#!/usr/bin/env perl

$_ = <<'LA VISTA, BABY';  # the Terminator, of course :)
    This @ShouldMatch1 and this @ShouldMatch2 and this @((ShouldNotMatch1)andthis@ShouldMatch3) and this @(ShouldNotMatch2 an
d this @1ShouldNotMatch3 and this @((ShouldNotMatch4 and this @(ShouldMatch4))'
LA VISTA, BABY

print $+{id}, "\n" while m{
    @ (?: \(  (?<id> \pL+ \d* )  \)
        |     (?<id> \pL+ \d* )
      )
}gx;

When run will print out your desired output of:

ShouldMatch1
ShouldMatch2
ShouldMatch3
ShouldMatch4

EDIT

Here is a five-stage elaboration of the previous solution from simpler to fancier. However, it is still unclear what the real rule for the identifier is, or should be.

original question: \pL+\d* That says it starts with letters and then might end with digits, but doesn’t have to.
\pL+\d+ makes digits mandatory.
\pL[\pL\d]* must start with a letter but then allows letters and digits to intermix.
\pL\w* adds underscore to the things that can come after the initial letter. Technically, \w according to UTS#18 is supposed to be all letters, all marks, the letter numbers (like Roman numerals), all decimal numbers, plus all connector punctuation.
\w+ is for all alphabetics, digits, or underscores through out, without restriction. That’s what word characters are according to the standard.
(?=\pL)\w+(?<=\d) adds a constraint that it must start with a letter and end with a digit, but otherwise can be combination of word characters.

No matter which of those is actually needed — it’s quite unclear — it should be easy enough to update the code to use the appropriate variant, especially in the last two versions where the definition of what counts as these funny identifiers occurs in just one place in the code. That makes it easy to change it in just one place and gets rid of update-incoherency bugs. Programmers should always strive to factor out duplicate code, and this is true no matter their programming language, even regexes, because abstraction is fundamental to good design.

Here then is the 5-way version:

#!/usr/bin/env perl

$_ = <<'LA VISTA, BABY';  # the Terminator, of course :)
    This @ShouldMatch1 and this @ShouldMatch2 and this @((ShouldNotMatch1)andthis@ShouldMatch3) and this @(ShouldNotMatch2 and this @1ShouldNotMatch3 and this @((ShouldNotMatch4 and this @(ShouldMatch4))'
LA VISTA, BABY

$mask  = "Version %d: %s\n";
$verno = 0;

##########################################################
# Simplest version: nothing fancy
++$verno;
printf($mask, $verno, $+) while /\@(?:(\pL+\d*)|\((\pL+\d*)\))/g;
print "\n";

##########################################################
# More readable version: add /x for spacing out regex contents
++$verno;
printf($mask, $verno, $+) while / \@ (?: (\pL+\d*) | \( (\pL+\d*) \) ) /xg;
print "\n";

##########################################################
# Use vertical alignment for greatly improved legibility,
# plus named captures for convenience and self-documentation
++$verno;
printf($mask, $verno, $+{id}) while m{
    @ (?: \(  (?<id> \pL+ \d* )  \)
        |     (?<id> \pL+ \d* )
      )
}xg;
print "\n";

##########################################################
# Define the "id" pattern separately from executing it
# to avoid code duplication. Improves maintainability.
# Likely requires Perl 5.10 or better, or PCRE, or PHP.
++$verno;
printf($mask, $verno, $+)     while m{
    (?(DEFINE)  (?<id> \pL+ \d* )   )

    @ (?: \( ((?&id)) \)
        |    ((?&id))
      )
}xg;
print "\n";

##########################################################
# this time we use a named capture that is different from
# the named group used for the definttion.
++$verno;
printf($mask, $verno, $+{id}) while m{
    (?(DEFINE)  (?<word> \pL+ \d* )   )

    @ (?: \( (?<id> (?&word) ) \)
        |    (?<id> (?&word) )
      )
}xg;

When run on Perl v5.10 or better, that duly produces:

Version 1: ShouldMatch1
Version 1: ShouldMatch2
Version 1: ShouldMatch3
Version 1: ShouldMatch4

Version 2: ShouldMatch1
Version 2: ShouldMatch2
Version 2: ShouldMatch3
Version 2: ShouldMatch4

Version 3: ShouldMatch1
Version 3: ShouldMatch2
Version 3: ShouldMatch3
Version 3: ShouldMatch4

Version 4: ShouldMatch1
Version 4: ShouldMatch2
Version 4: ShouldMatch3
Version 4: ShouldMatch4

Version 5: ShouldMatch1
Version 5: ShouldMatch2
Version 5: ShouldMatch3
Version 5: ShouldMatch4

It should be easy to update the definition of an id to match whatever is actually needed.

Note that some regex engines make it gratuitously cumbersome to specify properties. For example, they might require \p{L} instead of the normal \pL. That’s a Huffman‐encoding failure in their misdesign, because you always want the most commonly used form to be the shortest form. With \pL or \pN being just one character longer than \w or \d, people are much more inclined to reach for the improved versions, but things like \p{L} and \p{N} are now three characters longer than \w and \d, plus have unnecessary visual clutter to boot. You shouldn’t have to pay triple just to get the “normal” ᴀᴋᴀ most common case. :(

If you are going to put the ugly braces in, then you might as well write the thing out in full as \p{Letter} and \p{Number}. After all, “in for a penny, in for a pound,” as they say.

回复收藏 0 原文