C# 正则表达式用多个捕获和字符串末尾的匹配替换奇怪的行为？

发布于 2024-09-13 03:06:46 字数 1691 浏览 3 评论 0原文

我正在尝试编写一些格式化巴西电话号码的内容，但我希望它从字符串的末尾而不是开头进行匹配，因此它将根据以下模式转换输入字符串：

"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"

由于开始部分是通常会发生什么变化，我想使用 $ 符号构建匹配，这样它就会从末尾开始，然后向后捕获（所以我想），用所需的结束格式替换 then，然后去掉括号“如果它们是空的，则在前面添加 ()”。

这是 C# 代码：

s = "5135554444";
string str = Regex.Replace(s, @"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, @"^\(\) ", ""); //Get rid of empty () at the beginning

返回值与预期的 10 位数字相同。但除此之外，它最终会表现出一些奇怪的行为。这些是我的结果：

"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"

似乎它忽略了末尾的 $ 来进行匹配，但如果我用少于 7 位的数字进行测试，它会像这样：

"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"

注意它保留“最小” {n} 次第三个捕获组的总是从末尾捕获它，但是，前两组从头开始捕获，就好像最后一组从末尾开始不贪婪，只是获得最小值......奇怪还是我？

现在，如果我更改模式，那么在第三次捕获中我将使用 {4} 而不是 {1,4}，结果如下：

str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");

"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?

我知道这可能是我的一些愚蠢行为，但如果我这样做不是更合理吗？想要在字符串末尾捕获，所有以前的捕获组将以相反的顺序捕获？

我认为在最后一个例子中“54444”会变成“5-4444”...然后它不会...

如何实现这一点？

（我知道也许有更好的方法可以使用不同的方法来完成同样的事情......但我真正好奇的是找出为什么正则表达式的这种特殊行为看起来很奇怪。所以，这个问题的答案应该集中在解释为什么最后一个捕获锚定在字符串的末尾，以及为什么其他捕获没有锚定，如本示例所示，因此我对实际的电话 # 格式问题不是特别感兴趣，而是了解正则表达式语法。 ...

谢谢...

原文

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern:

"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"

Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty.

This is the C# code:

s = "5135554444";
string str = Regex.Replace(s, @"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, @"^\(\) ", ""); //Get rid of empty () at the beginning

The return value was as expected for a 10 digit number. But for anything less than that, it ended up showing some strange behavior. These were my results:

"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"

It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this:

"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"

Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me?

Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results:

str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");

"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?

I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order?

I would think that "54444" would turn into "5-4444" in this last example... then it does not...

How would one accomplish this?

(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting problem, but to understand the Regex sintax)...

Thanks...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

救星 2024-09-20 03:06:46

因此，您希望第三部分始终具有四位数字，第二部分始终具有零到四位数字，第一部分始终具有零到两位数字，但前提是第二部分包含四位数字？

使用

^(\d{0,2}?)(\d{0,4})(\d{4})$

作为 C# 代码片段，评论道：

resultString = Regex.Replace(subjectString, 
  @"^             # anchor the search at the start of the string
    (\d{0,2}?)    # match as few digits as possible, maximum 2
    (\d{0,4})     # match up to four digits, as many as possible
    (\d{4})       # match exactly four digits
    $             # anchor the search at the end of the string", 
   "($1) $2-$3", RegexOptions.IgnorePatternWhitespace);

通过将 ? 添加到量词 (??, *?, +?, {a,b}?) 你让它变得懒惰，即告诉它匹配尽可能少的字符，同时仍然允许找到整体匹配。

如果第一组中没有 ?，尝试匹配 123456 时会发生什么？

首先，\d{0,2} 匹配12。

然后，\d{0,4} 匹配 3456。

然后，\d{4} 没有任何剩余内容可以匹配，因此正则表达式引擎会回溯，直到再次可以匹配为止。经过四步后，\d{4} 可以匹配 3456。 \d{0,4} 放弃了它贪婪地匹配的所有内容。

现在，已经找到了整体匹配 - 无需尝试更多组合。因此，第一组和第三组将包含部分比赛。

So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?

Use

^(\d{0,2}?)(\d{0,4})(\d{4})$

As a C# snippet, commented:

resultString = Regex.Replace(subjectString, 
  @"^             # anchor the search at the start of the string
    (\d{0,2}?)    # match as few digits as possible, maximum 2
    (\d{0,4})     # match up to four digits, as many as possible
    (\d{4})       # match exactly four digits
    $             # anchor the search at the end of the string", 
   "($1) $2-$3", RegexOptions.IgnorePatternWhitespace);

By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.

Without the ? in the first group, what would happen when trying to match 123456?

First, the \d{0,2} matches 12.

Then, the \d{0,4} matches 3456.

Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.

Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.

回复收藏 0 原文