替换字符串中的反向引用语法(为什么是美元符号?)

发布于 2024-09-02 18:33:39 字数 981 浏览 1 评论 0原文

在 Java 中以及其他一些语言中,模式中的反向引用前面有一个反斜杠(例如 \1\2\3 等),但在替换字符串中,它们前面带有美元符号(例如 $1$2$3 以及 $0)。

下面是一个用于说明的代码片段:

System.out.println(
    "left-right".replaceAll("(.*)-(.*)", "\\2-\\1") // WRONG!!!
); // prints "2-1"

System.out.println(
    "left-right".replaceAll("(.*)-(.*)", "$2-$1")   // CORRECT!
); // prints "right-left"

System.out.println(
    "You want million dollar?!?".replaceAll("(\\w*) dollar", "US\\$ $1")
); // prints "You want US$ million?!?"

System.out.println(
    "You want million dollar?!?".replaceAll("(\\w*) dollar", "US$ \\1")
); // throws IllegalArgumentException: Illegal group reference

问题:

  • 在替换字符串中使用 $ 进行反向引用是 Java 特有的吗?如果不是,用什么语言启动的?哪些口味使用它,哪些不使用?
  • 为什么这是个好主意?为什么不坚持相同的模式语法?这难道不会带来更有凝聚力、更容易学习的语言吗?
    • 如果上面的语句 1 和 4 是“正确”的语句,而不是 2 和 3,那么语法不是更精简吗?

In Java, and it seems in a few other languages, backreferences in the pattern are preceded by a backslash (e.g. \1, \2, \3, etc), but in a replacement string they preceded by a dollar sign (e.g. $1, $2, $3, and also $0).

Here's a snippet to illustrate:

System.out.println(
    "left-right".replaceAll("(.*)-(.*)", "\\2-\\1") // WRONG!!!
); // prints "2-1"

System.out.println(
    "left-right".replaceAll("(.*)-(.*)", "$2-$1")   // CORRECT!
); // prints "right-left"

System.out.println(
    "You want million dollar?!?".replaceAll("(\\w*) dollar", "US\\$ $1")
); // prints "You want US$ million?!?"

System.out.println(
    "You want million dollar?!?".replaceAll("(\\w*) dollar", "US$ \\1")
); // throws IllegalArgumentException: Illegal group reference

Questions:

  • Is the use of $ for backreferences in replacement strings unique to Java? If not, what language started it? What flavors use it and what don't?
  • Why is this a good idea? Why not stick to the same pattern syntax? Wouldn't that lead to a more cohesive and an easier to learn language?
    • Wouldn't the syntax be more streamlined if statements 1 and 4 in the above were the "correct" ones instead of 2 and 3?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

摘星┃星的人 2024-09-09 18:33:39

在替换字符串中使用 $ 进行反向引用是 Java 特有的吗?

不。Perl 使用它,而且 Perl 肯定早于 Java 的 Pattern 类。 Java 的正则表达式支持是根据 Perl 正则表达式明确描述的。

例如: http://perldoc.perl.org/perlrequick.html#Search-并-替换

为什么这是一个好主意?

显然你不认为这是一个好主意!但它是个好主意的原因之一是使 Java 搜索/替换支持(更多)与 Perl 兼容。

$ 可能被视为比 \ 更好的选择,还有另一个可能原因。也就是说,在 Java 字符串文字中,\ 必须写为 \\

但这一切都纯粹是猜测。当做出设计决定时,我们都不在房间里。最终,他们为什么这样设计替换字符串语法并不重要。这些决定已经做出并具体确定,任何进一步的讨论都纯粹是学术性的......除非您恰好正在为 Java 设计一种新语言或新的正则表达式库。

Is the use of $ for backreferences in replacement strings unique to Java?

No. Perl uses it, and Perl certainly predates Java's Pattern class. Java's regex support is explicitly described in terms of Perl regexes.

For example: http://perldoc.perl.org/perlrequick.html#Search-and-replace

Why is this a good idea?

Well obviously you don't think it is a good idea! But one reason that it is a good idea is to make Java search/replace support (more) compatible with Perl's.

There is another possible reason why $ might have been viewed as a better choice than \. That is that \ has to be written as \\ in a Java String literal.

But all of this is pure speculation. None of us were in the room when the design decisions were made. And ultimately it doesn't really matter why they designed the replacement String syntax that way. The decisions have been made and set in concrete, and any further discussion is purely academic ... unless you just happen to be designing a new language or a new regex library for Java.

冬天旳寂寞 2024-09-09 18:33:39

经过一些研究后,我现在已经理解了这些问题:Perl 必须为模式反向引用和替换反向引用使用不同的符号,而java.util.regex.*没有必须效仿,它选择这样做,不是出于技术原因,而是出于传统原因。


在 Perl 方面

(请记住,目前我对 Perl 的了解都来自阅读 Wikipedia 文章,因此请随时纠正我可能犯的任何错误)

原因 在 Perl 中必须以这种方式完成如下:

  • Perl 使用 $ 作为印记(即附加到变量名称的符号)。
  • Perl 字符串文字是变量插值的。
  • Perl 正则表达式实际上将组捕获为变量 $1$2 等。

因此,由于 Perl 的解释方式及其正则表达式引擎的工作方式,反向引用的前斜杠(例如必须使用模式中的\1),因为如果使用$(例如$1),则会导致意外的变量插值进入图案。

由于替换字符串在 Perl 中的工作方式,它会在每个匹配的上下文中进行评估。 Perl 在这里使用变量插值是最自然的,因此正则表达式引擎将组捕获到变量 $1$2 等中,以使其与其余部分无缝工作。语言。

参考文献


在 Java 方面,

Java 是一种与 Perl 非常不同的语言,但最重要的是,有无变量插值。此外,replaceAll 是一个方法调用,与 Java 中的所有方法调用一样,参数在调用方法之前计算一次。

因此,变量插值功能本身是不够的,因为本质上必须在每次匹配时重新评估替换字符串,而这不是 Java 中方法调用的语义。在调用 replaceAll 之前评估的变量插值替换字符串实际上是无用的;插值需要在每次匹配的方法期间发生。

由于这不是 Java 语言的语义,因此 replaceAll 必须手动执行此“即时”插值。因此,绝对没有技术原因为什么 $ 是替换字符串中反向引用的转义符号。它很可能是 \。相反,模式中的反向引用也可以使用 $ 而不是 \ 进行转义,并且在技术上它仍然可以正常工作。

Java 以这种方式执行正则表达式的原因纯粹是传统的:它只是遵循 Perl 设定的先例。

After doing some research, I've understood the issues now: Perl had to use a different symbol for pattern backreferences and replacement backreferences, and while java.util.regex.* doesn't have to follow suit, it chooses to, not for a technical but rather traditional reason.


On the Perl side

(Please keep in mind that all I know about Perl at this point comes from reading Wikipedia articles, so feel free to correct any mistakes I may have made)

The reason why it had to be done this way in Perl is the following:

  • Perl uses $ as a sigil (i.e. a symbol attached to variable name).
  • Perl string literals are variable interpolated.
  • Perl regex actually captures groups as variables $1, $2, etc.

Thus, because of the way Perl is interpreted and how its regex engine works, a preceding slash for backreferences (e.g. \1) in the pattern must be used, because if the sigil $ is used instead (e.g. $1), it would cause unintended variable interpolation into the pattern.

The replacement string, due to how it works in Perl, is evaluated within the context of every match. It is most natural for Perl to use variable interpolation here, so the regex engine captures groups into variables $1, $2, etc, to make this work seamlessly with the rest of the language.

References


On the Java side

Java is a very different language than Perl, but most importantly here is that there is no variable interpolation. Moreover, replaceAll is a method call, and as with all method calls in Java, arguments are evaluated once, prior to the method invoked.

Thus, variable interpolation feature by itself is not enough, since in essence the replacement string must be re-evaluated on every match, and that's just not the semantics of method calls in Java. A variable-interpolated replacement string that is evaluated before the replaceAll is even invoked is practically useless; the interpolation needs to happen during the method, on every match.

Since that is not the semantics of Java language, replaceAll must do this "just-in-time" interpolation manually. As such, there is absolutely no technical reason why $ is the escape symbol for backreferences in replacement strings. It could've very well been the \. Conversely, backreferences in the pattern could also have been escaped with $ instead of \, and it would've still worked just as fine technically.

The reason Java does regex the way it does is purely traditional: it's simply following the precedent set by Perl.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文