Python 正则表达式 - r 前缀

发布于 2024-08-21 15:27:49 字数 609 浏览 9 评论 0原文

谁能解释为什么下面的示例 1 在不使用 r 前缀的情况下有效? 我认为每当使用转义序列时都必须使用 r 前缀。 示例 2 和示例 3 证明了这一点。

# example 1
import re
print (re.sub('\s+', ' ', 'hello     there      there'))
# prints 'hello there there' - not expected as r prefix is not used

# example 2
import re
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello     there      there'))
# prints 'hello     there' - as expected as r prefix is used

# example 3
import re
print (re.sub('(\b\w+)(\s+\1\b)+', '\1', 'hello     there      there'))
# prints 'hello     there      there' - as expected as r prefix is not used

Can anyone explain why example 1 below works, when the r prefix is not used?
I thought the r prefix must be used whenever escape sequences are used.
Example 2 and example 3 demonstrate this.

# example 1
import re
print (re.sub('\s+', ' ', 'hello     there      there'))
# prints 'hello there there' - not expected as r prefix is not used

# example 2
import re
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello     there      there'))
# prints 'hello     there' - as expected as r prefix is used

# example 3
import re
print (re.sub('(\b\w+)(\s+\1\b)+', '\1', 'hello     there      there'))
# prints 'hello     there      there' - as expected as r prefix is not used

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

甜尕妞 2024-08-28 15:27:49

因为 \ 仅当转义序列是有效的转义序列时才开始。

>>> '\n'
'\n'
>>> r'\n'
'\\n'
>>> print '\n'


>>> print r'\n'
\n
>>> '\s'
'\\s'
>>> r'\s'
'\\s'
>>> print '\s'
\s
>>> print r'\s'
\s

除非存在“r”或“R”前缀,转义序列 根据类似于标准 C 使用的规则进行解释。可识别的转义序列为:

转义序列含义注释
\换行符被忽略  
\\ 反斜杠 (\)    
\' 单引号 (')     
\" 双引号 (")     
\a ASCII 贝尔 (BEL)     
\b ASCII 退格键 (BS)     
\f ASCII 换页符 (FF)  
\n ASCII 换行 (LF)  
\N{name} Unicode 数据库中名为 name 的字符(仅限 Unicode)  
\r ASCII 回车符 (CR)   
\t ASCII 水平制表符 (TAB)   
\uxxxx 具有 16 位十六进制值 xxxx 的字符(仅限 Unicode) 
\Uxxxxxxxx 具有 32 位十六进制值 xxxxxxxx 的字符(仅限 Unicode) 
\v ASCII 垂直制表符 (VT)  
\ooo 具有八进制值 ooo 的字符
\xhh 具有十六进制值 hh 的字符

:在路径文字的原始字符串上,因为原始字符串有一些相当特殊的内部工作原理,众所周知,这些工作原理已经让人痛不欲生了:

当存在“r”或“R”前缀时,反斜杠后面的字符将不做任何更改地包含在字符串中,并且所有反斜杠都保留在字符串中。例如,字符串文字 r"\n" 由两个字符组成:反斜杠和小写“n”。字符串引号可以用反斜杠转义,但反斜杠保留在字符串中;例如,r"\"" 是由两个字符组成的有效字符串文字:反斜杠和双引号;r"\" 不是有效的字符串文字 (即使原始字符串也不能以奇数个反斜杠结尾)。具体来说,原始字符串不能以单个反斜杠结尾(因为反斜杠会转义后面的引号字符)。另请注意,单个反斜杠后跟换行符会被解释为。这两个字符作为字符串的一部分,而不是作为行延续。

为了更好地说明最后一点:

>>> r'\'
SyntaxError: EOL while scanning string literal
>>> r'\''
"\\'"
>>> '\'
SyntaxError: EOL while scanning string literal
>>> '\''
"'"
>>> 
>>> r'\\'
'\\\\'
>>> '\\'
'\\'
>>> print r'\\'
\\
>>> print r'\'
SyntaxError: EOL while scanning string literal
>>> print '\\'
\

Because \ begin escape sequences only when they are valid escape sequences.

>>> '\n'
'\n'
>>> r'\n'
'\\n'
>>> print '\n'


>>> print r'\n'
\n
>>> '\s'
'\\s'
>>> r'\s'
'\\s'
>>> print '\s'
\s
>>> print r'\s'
\s

Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:

Escape Sequence   Meaning Notes
\newline  Ignored  
\\    Backslash (\)    
\'    Single quote (')     
\"    Double quote (")     
\a    ASCII Bell (BEL)     
\b    ASCII Backspace (BS)     
\f    ASCII Formfeed (FF)  
\n    ASCII Linefeed (LF)  
\N{name}  Character named name in the Unicode database (Unicode only)  
\r    ASCII Carriage Return (CR)   
\t    ASCII Horizontal Tab (TAB)   
\uxxxx    Character with 16-bit hex value xxxx (Unicode only) 
\Uxxxxxxxx    Character with 32-bit hex value xxxxxxxx (Unicode only) 
\v    ASCII Vertical Tab (VT)  
\ooo  Character with octal value ooo
\xhh  Character with hex value hh

Never rely on raw strings for path literals, as raw strings have some rather peculiar inner workings, known to have bitten people in the ass:

When an "r" or "R" prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase "n". String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.

To better illustrate this last point:

>>> r'\'
SyntaxError: EOL while scanning string literal
>>> r'\''
"\\'"
>>> '\'
SyntaxError: EOL while scanning string literal
>>> '\''
"'"
>>> 
>>> r'\\'
'\\\\'
>>> '\\'
'\\'
>>> print r'\\'
\\
>>> print r'\'
SyntaxError: EOL while scanning string literal
>>> print '\\'
\
得不到的就毁灭 2024-08-28 15:27:49

'r' 表示以下是“原始字符串”,即。反斜杠字符按字面意思处理,而不是表示对后面的字符进行特殊处理。

http://docs.python.org/reference/lexical_analysis.html#literals

所以 '\n' 是一个换行符
r'\n' 是两个字符 - 反斜杠和字母 'n'
另一种写法是 '\\n' 因为第一个反斜杠转义了第二个反斜杠,

等效的写法

print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello     there      there'))

print (re.sub('(\\b\\w+)(\\s+\\1\\b)+', '\\1', 'hello     there      there'))

因为 Python 处理无效转义字符的方式,而不是所有转义字符这些双反斜杠是必要的 - 例如 '\s'=='\\s' 但对于 '\b''\\ 则不然b'。我的偏好是明确并将所有反斜杠加倍。

the 'r' means the the following is a "raw string", ie. backslash characters are treated literally instead of signifying special treatment of the following character.

http://docs.python.org/reference/lexical_analysis.html#literals

so '\n' is a single newline
and r'\n' is two characters - a backslash and the letter 'n'
another way to write it would be '\\n' because the first backslash escapes the second

an equivalent way of writing this

print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello     there      there'))

is

print (re.sub('(\\b\\w+)(\\s+\\1\\b)+', '\\1', 'hello     there      there'))

Because of the way Python treats characters that are not valid escape characters, not all of those double backslashes are necessary - eg '\s'=='\\s' however the same is not true for '\b' and '\\b'. My preference is to be explicit and double all the backslashes.

豆芽 2024-08-28 15:27:49

并非所有涉及反斜杠的序列都是转义序列。例如,\t\f 是,但 \s 不是。在非原始字符串文字中,任何不属于转义序列的 \ 都被视为另一个 \

>>> "\s"
'\\s'
>>> "\t"
'\t'

\b是一个转义序列,因此示例 3 失败。 (是的,有些人认为这种行为相当不幸。)

Not all sequences involving backslashes are escape sequences. \t and \f are, for example, but \s is not. In a non-raw string literal, any \ that is not part of an escape sequence is seen as just another \:

>>> "\s"
'\\s'
>>> "\t"
'\t'

\b is an escape sequence, however, so example 3 fails. (And yes, some people consider this behaviour rather unfortunate.)

川水往事 2024-08-28 15:27:49

尝试一下:

a = '\''
'
a = r'\''
\'
a = "\'"
'
a = r"\'"
\'

Try that:

a = '\''
'
a = r'\''
\'
a = "\'"
'
a = r"\'"
\'
简美 2024-08-28 15:27:49

检查下面的例子:

print r"123\n123" 
#outputs>>>
123\n123


print "123\n123"
#outputs>>>
123
123

Check below example:

print r"123\n123" 
#outputs>>>
123\n123


print "123\n123"
#outputs>>>
123
123
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文