扩展正则表达式如何解释 \\n 和 \\\n？

发布于 2024-11-28 03:32:24 字数 501 浏览 1 评论 0原文

在 ERE 中，反斜杠字符 (\, \a, \b, \f, \n, \r、\t、\v) 被视为转义序列的开始。

然后我看到 \\n 和 [\\\n]，我可以猜测 \\n 和 [\\ \n] 这里意味着 \ 后跟换行符，但我对解释此类序列的确切过程感到困惑，例如需要多少个 \全部？

更新

我在理解编程语言中的正则表达式方面没有问题，因此请在词法分析器。

[root@ ]# echo "test\
> hi"

原文

Within an ERE, a backslash character (\, \a, \b, \f, \n,
\r, \t, \v) is considered to begin an escape sequence.

Then I see \\n and [\\\n], I can guess though both \\n and [\\\n] here means \ followed by new line, but I'm confused by the exact process to interpret such sequence as how many \s are required at all?

UPDATE

I don't have problem understanding regex in programing languages so please make the context within the lexer.

[root@ ]# echo "test\
> hi"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嘦怹 2024-12-05 03:32:24

实际上，字符串文字指定的正则表达式字符串由两个编译器处理：编程语言编译器和正则表达式编译器：

    Original  Compiled    Regex compiled
    "\n"      NL          NL
    "\\n"     '\'+'n'     NL
    "\\\n"    '\'+NL      NL
    "\\\\n"   '\'+'\'+'n' '\'+'n'

因此您必须使用最短的格式“\n”。

代码示例：

JavaScript：

    'a\nb'.replace(RegExp("\n"),'<br>')
    'a\nb'.replace(RegExp("\\n"),'<br>')
    'a\nb'.replace(RegExp("\\\n"),'<br>')

但不是：

    'a\nb'.replace(/\\\n/,'<br>')

Java：

    System.out.println("a\nb".replaceAll("\n","<br>"));
    System.out.println("a\nb".replaceAll("\\n","<br>"));
    System.out.println("a\nb".replaceAll("\\\n","<br>"));

Python：

    str.join('<br>',regex.split('\n','a\nb'))
    str.join('<br>',regex.split('\\n','a\nb'))
    str.join('<br>',regex.split('\\\n','a\nb'))

Actually regex string specified by string literal is processed by two compilers: programming language compiler and regexp compiler:

    Original  Compiled    Regex compiled
    "\n"      NL          NL
    "\\n"     '\'+'n'     NL
    "\\\n"    '\'+NL      NL
    "\\\\n"   '\'+'\'+'n' '\'+'n'

So you must use the shortest format "\n".

Code examples:

JavaScript:

    'a\nb'.replace(RegExp("\n"),'<br>')
    'a\nb'.replace(RegExp("\\n"),'<br>')
    'a\nb'.replace(RegExp("\\\n"),'<br>')

but not:

    'a\nb'.replace(/\\\n/,'<br>')

Java:

    System.out.println("a\nb".replaceAll("\n","<br>"));
    System.out.println("a\nb".replaceAll("\\n","<br>"));
    System.out.println("a\nb".replaceAll("\\\n","<br>"));

Python:

    str.join('<br>',regex.split('\n','a\nb'))
    str.join('<br>',regex.split('\\n','a\nb'))
    str.join('<br>',regex.split('\\\n','a\nb'))

回复收藏 0 原文

郁金香雨 2024-12-05 03:32:24

这取决于编程语言及其字符串处理选项。

例如，在 Java 字符串中，如果需要在字符串中使用文字反斜杠，则需要将其加倍。因此正则表达式 \n 必须写为 "\\n"。如果您打算使用正则表达式来匹配反斜杠，那么您需要将其转义两次 - 一次用于 Java 的字符串处理程序，一次用于正则表达式引擎。因此，要匹配 \，正则表达式为 \\，对应的 Java 字符串为 "\\\\"。

许多编程语言都有特殊的“逐字”或“原始”字符串，您不需要转义反斜杠。因此，正则表达式 \n 可以写为普通 Python 字符串 "\\n" 或 Python 原始字符串 r"\n". Python 字符串 "\n" 是实际的换行符。

这可能会变得令人困惑，因为有时不转义反斜杠恰好可以工作。例如，Python 字符串 "\d\n" 恰好用作正则表达式，旨在匹配数字，后跟换行符。这是因为 \d 不是 Python 字符串中可识别的字符转义序列，因此它被保留为文字 \d 并以这种方式馈送到正则表达式引擎。 \n 被转换为实际的换行符，但这恰好与正则表达式测试的字符串中的换行符匹配。

但是，如果您忘记转义反斜杠，而结果序列是有效的字符转义序列，则会发生不好的事情。例如，正则表达式 \bfoo\b 匹配整个单词 foo （但与 foobar< 中的 foo 不匹配） /代码>）。如果将正则表达式字符串编写为 "\bfoo\b"，则 \b 会被字符串处理器转换为退格字符，因此正则表达式引擎会被告知进行匹配foo 显然会失败。

解决方案：始终在有的地方使用逐字字符串（例如 Python 的 r"..."、.NET 的 @"..."），或者在有的地方使用正则表达式文字它们（例如 JavaScript 和 Ruby 的 /.../）。或者使用 RegexBuddy 自动将正则表达式翻译为您语言的特殊格式。

回到您的示例：

\\n 作为正则表达式意味着“匹配反斜杠，后跟 n”
[\\\n]作为正则表达式意味着“匹配反斜杠或换行符”。

This is dependent on the programming language and on its string handling options.

For example, in Java strings, if you need a literal backslash in a string, you need to double it. So the regex \n must be written as "\\n". If you plan to match a backslash using a regex, then you need to escape it twice - once for Java's string handler, and once for the regex engine. So, to match \, the regex is \\, and the corresponding Java string is "\\\\".

Many programming languages have special "verbatim" or "raw" strings where you don't need to escape backslashes. So the regex \n can be written as a normal Python string as "\\n" or as a Python raw string as r"\n". The Python string "\n" is the actual newline character.

This can becoming confusing, because sometimes not escaping the backslash happens to work. For example the Python string "\d\n" happens to work as a regex that's intended to match a digit, followed by a newline. This is because \d isn't a recognized character escape sequence in Python strings, so it's kept as a literal \d and fed that way to the regex engine. The \n is translated to an actual newline, but that happens to match the newline in the string that the regex is tested against.

However, if you forget to escape a backslash where the resulting sequence is a valid character escape sequence, bad things happen. For example, the regex \bfoo\b matches an entire word foo (but it doesn't match the foo in foobar). If you write the regex string as "\bfoo\b", the \bs are translated into backspace characters by the string processor, so the regex engine is told to match <backspace>foo<backspace> which obviously will fail.

Solution: Always use verbatim strings where you have them (e. g. Python's r"...", .NET's @"...") or use regex literals where you have them (e. g. JavaScript's and Ruby's /.../). Or use RegexBuddy to automatically translate the regex for you into your language's special format.

To get back to your examples: