Ruby 中奇怪的反斜杠替换

发布于 2024-08-07 00:26:19 字数 761 浏览 13 评论 0原文

我不明白这段 Ruby 代码:

>> puts '\\ <- single backslash'
# \ <- single backslash

>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced

到目前为止,一切都按预期进行。但是如果我们用 /\\/ 搜索 1,并用 '\\\\' 编码的 2 替换,为什么我们会得到这个:

>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2

然后,当我们用 '\\\\\\' 编码 3,我们只得到 2:

>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3

任何人都能够理解为什么反斜杠会被吞入替换字符串中?这发生在 1.8 和 1.9 上。

I don't understand this Ruby code:

>> puts '\\ <- single backslash'
# \ <- single backslash

>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced

so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:

>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2

and then, when we encode 3 with '\\\\\\', we only get 2:

>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3

anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

北凤男飞 2024-08-14 00:26:19

快速回答

如果您想避免所有这些混乱,请使用更不易混淆的块语法。下面是一个用 2 个反斜杠替换每个反斜杠的示例:

"some\\path".gsub('\\') { '\\\\' }

可怕的细节

问题在于,当使用 sub (和 gsub)时,没有块,ruby 会解释 special替换参数中的字符序列。不幸的是,sub 使用反斜杠作为这些字符的转义字符:

\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)

与任何转义一样,这会产生一个明显的问题。如果您想在输出字符串中包含上述序列之一的文字值(例如 \1),则必须对其进行转义。因此,要获得 Hello \1,您需要将替换字符串设置为 Hello \\1。要将其表示为 Ruby 中的字符串文字,您必须再次转义这些反斜杠,如下所示: "Hello \\\\1"

因此,有两种不同的转义通道。第一个获取字符串文字并创建内部字符串值。第二个获取该内部字符串值并用匹配数据替换上面的序列。

如果反斜杠后面没有与上述序列之一匹配的字符,则反斜杠(以及后面的字符)将不改变地通过。这也会影响字符串末尾的反斜杠——它将不加改变地通过。在 rubinius 代码中最容易看到这个逻辑;只需在 中查找 to_sub_replacement 方法字符串类

以下是 一些示例,说明 String#sub 如何解析替换字符串:

  • 1 反斜杠 \ (其字符串文字为 "\\")

    原样通过,因为反斜杠位于字符串末尾并且后面没有任何字符。

    结果: \

  • 2 个反斜杠 \\ (字符串文字为 "\ \\\")

    这对反斜杠与转义的反斜杠序列匹配(请参阅上面的 \\),并转换为单个反斜杠。

    结果: \

  • 3 个反斜杠 \\\ (字符串文字为 " \\\\\\")

    前两个反斜杠与 \\ 序列匹配,并转换为单个反斜杠。然后最后一个反斜杠位于字符串的末尾,因此它不会改变。

    结果: \\

  • 4 个反斜杠 \\\\ (字符串文字为 “\\\\\\\\”)

    两对反斜杠分别匹配 \\ 序列并转换为单个反斜杠。

    结果: \\

  • 2 个反斜杠,中间有字符 \a\ (其中有一个字符串"\\a\\" 的文字)

    \a 与任何转义序列都不匹配,因此允许它不改变地通过。尾随反斜杠也允许通过。

    结果: \a\

    注意:可以从以下位置获得相同的结果:\\a\\(使用文字字符串:"\\\\a\\\ \")

事后看来,如果 String#sub 使用不同的转义字符,这可能不会那么混乱。这样就不需要双重转义所有反斜杠了。

Quick Answer

If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:

"some\\path".gsub('\\') { '\\\\' }

Gruesome Details

The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:

\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)

Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"

So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.

If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.

Here are some examples of how String#sub is parsing the replacement string:

  • 1 backslash \ (which has a string literal of "\\")

    Passes through unaltered because the backslash is at the end of the string and has no characters after it.

    Result: \

  • 2 backslashes \\ (which have a string literal of "\\\\")

    The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.

    Result: \

  • 3 backslashes \\\ (which have a string literal of "\\\\\\")

    The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.

    Result: \\

  • 4 backslashes \\\\ (which have a string literal of "\\\\\\\\")

    Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.

    Result: \\

  • 2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")

    The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.

    Result: \a\

    Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")

In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.

掩于岁月 2024-08-14 00:26:19

这是一个问题,因为反斜杠 (\) 充当正则表达式和字符串的转义字符。您可以使用特殊变量 \&减少 gsub 替换字符串中反斜杠的数量。

foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\

编辑:我应该提到 \& 的价值来自正则表达式匹配,在本例中是单个反斜杠。

另外,我认为有一种特殊的方法来创建一个禁用转义字符的字符串,但显然不是。这些都不会产生两个斜杠:

puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF  

This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.

foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\

EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.

Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:

puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF  
拥抱我好吗 2024-08-14 00:26:19

啊,在我输入所有这些内容后,我意识到 \ 用于引用替换字符串中的组。我想这意味着您需要在替换字符串中使用文字 \\ 来替换 \。要获得文字 \\,您需要四个 \,因此要将 1 替换为 2,您实际上需要 8 个(!)。

# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')

我缺少什么吗?还有什么更有效的方法吗?

argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).

# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')

anything I'm missing? any more efficient ways?

何处潇湘 2024-08-14 00:26:19

澄清作者第二行代码中的一些混乱。

你说:

>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced

2 个反斜杠在这里没有被替换。您要将 1 个转义 反斜杠替换为两个 a ('aa')。也就是说,如果您使用 .sub(/\\/, 'a'),您只会看到一个 'a'

'\\'.sub(/\\/, 'anything') #=> anything

Clearing up a little confusion on the author's second line of code.

You said:

>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced

2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'

'\\'.sub(/\\/, 'anything') #=> anything
昔梦 2024-08-14 00:26:19

实际上,镐书提到了这个确切的问题。这是另一种选择(来自最新版本的第 130 页)

str = 'a\b\c'               # => "a\b\c"
str.gsub(/\\/) { '\\\\' }   # => "a\\b\\c"

the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)

str = 'a\b\c'               # => "a\b\c"
str.gsub(/\\/) { '\\\\' }   # => "a\\b\\c"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文