“u”到底做什么?和“r”字符串前缀有什么作用,什么是原始字符串文字?

发布于 2024-08-18 07:49:06 字数 512 浏览 13 评论 0原文

在询问 这个问题时,我意识到我对原始字符串了解不多。对于自称是 Django 培训师的人来说,这很糟糕。

我知道什么是编码,而且我知道 u'' 单独做什么,因为我知道什么是 Unicode。

  • 但是 r'' 到底是做什么的呢?它会产生什么样的字符串?

  • 最重要的是,ur'' 到底做了什么?

  • 最后,有没有可靠的方法可以从 Unicode 字符串返回到简单的原始字符串?

  • 啊,顺便问一下,如果您的系统和文本编辑器字符集设置为 UTF-8,u'' 实际上会执行任何操作吗?

While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.

I know what an encoding is, and I know what u'' alone does since I get what is Unicode.

  • But what does r'' do exactly? What kind of string does it result in?

  • And above all, what the heck does ur'' do?

  • Finally, is there any reliable way to go back from a Unicode string to a simple raw string?

  • Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

无力看清 2024-08-25 07:49:06

实际上并不存在任何“原始字符串”;有原始的字符串文字,它们正是在开始引号之前由'r'标记的字符串文字。

“原始字符串文字”与字符串文字的语法略有不同,其中反斜杠 \ 被视为“只是一个反斜杠”(除非它正好出现在引用的前面)否则终止文字)——没有“转义序列”来表示换行符、制表符、退格键、换页符等。在普通字符串文字中,每个反斜杠必须加倍以避免被视为转义序列的开头。

这种语法变体的存在主要是因为正则表达式模式的语法中充满了反斜杠(但永远不会在末尾,因此上面的“ except”子句并不重要),并且当您避免将它们中的每一个都加倍时,它看起来会更好一些 - - 就这样。它在表达本机 Windows 文件路径(使用反斜杠而不是像其他平台上的常规斜杠)方面也受到了一些欢迎,但很少需要(因为普通斜杠大多数在 Windows 上也能正常工作)并且不完善(由于“例外”子句)多于)。

r'...' 是一个字节字符串(在 Python 2.* 中),ur'...' 是一个 Unicode 字符串(同样,在 Python 2.* 中) ),其他三种引用中的任何一种也会产生完全相同类型的字符串(例如 r'...'r'''...'''r"..."r"""...""" 都是字节字符串,依此类推)。

不确定“返回后退”是什么意思 - 本质上没有后退和前进方向,因为没有原始字符串类型,它只是表达完全正常的替代语法字符串对象、字节或 unicode。

是的,在 Python 2.* 中,u'...' is 当然总是不同于 '...' --前者是unicode字符串,后者是字节字符串。文字可以用什么编码来表达是一个完全正交的问题。

例如,考虑(Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

Unicode 对象当然需要更多的内存空间(对于很短的字符串来说,差异非常小,显然;-)。

There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.

A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).

伴我心暖 2024-08-25 07:49:06

Python 2 中有两种类型的字符串:传统的 str 类型和较新的 unicode 类型。如果您键入前面没有 u 的字符串文字,您将得到存储 8 位字符的旧 str 类型,并且在在前面你会得到更新的 unicode 类型,它可以存储任何 Unicode 字符。

r 根本不改变类型,它只是改变字符串文字的解释方式。如果没有 r,反斜杠将被视为转义字符。对于r,反斜杠被视为文字。无论哪种方式,类型都是相同的。

ur 当然是一个 Unicode 字符串,其中反斜杠是文字反斜杠,而不是转义码的一部分。

您可以尝试使用 str() 函数将 Unicode 字符串转换为旧字符串,但如果旧字符串中存在无法表示的任何 Unicode 字符,则会出现异常。如果您愿意,可以先用问号替换它们,但这当然会导致这些字符不可读。如果要正确处理unicode字符,不建议使用str类型。

There are two types of string in Python 2: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.

The r doesn't change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.

ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.

You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.

恋你朝朝暮暮 2024-08-25 07:49:06

“原始字符串”表示它按其出现的样子存储。例如,'\' 只是一个反斜杠,而不是转义

'raw string' means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.

朱染 2024-08-25 07:49:06

我简单解释一下:
在 python 2 中,您可以以两种不同的类型存储字符串。

第一个是ASCII,它是python中的str类型,它使用1个字节的内存。 (256个字符,主要存储英文字母和简单符号)

第二种类型是UNICODE,它是python中的unicode类型。 Unicode 存储所有类型的语言。

默认情况下,python 更喜欢 str 类型,但如果你想以 unicode 类型存储字符串,你可以将 u 放在文本前面,例如 < strong>u'text' 或者您可以通过调用 unicode('text') 来完成此操作,

因此 u 只是调用函数进行强制转换的一种简短方法strunicode。就是这样!

现在的 r 部分,你把它放在文本前面,告诉计算机该文本是原始文本,反斜杠不应该是转义字符。 r'\n' 不会创建新行字符。它只是包含 2 个字符的纯文本。

如果您想将 str 转换为 unicode 并在其中放入原始文本,请使用 ur 因为 ru 会引发一个错误。

现在,重要的部分:

您不能使用r存储一个反斜杠,这是唯一的例外。
所以这段代码会产生错误: r'\'

要存储反斜杠(只有一个),你需要使用 '\\'

如果你想存储超过 1 个字符您仍然可以使用 r ,例如 r'\\' 会按照您的预期产生 2 个反斜杠。

我不知道为什么 r 不能与一个反斜杠存储一起使用,但尚未有人描述其原因。我希望这是一个错误。

Let me explain it simply:
In python 2, you can store string in 2 different types.

The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)

The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.

By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')

So u is just a short way to call a function to cast str to unicode. That's it!

Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r'\n' will not create a new line character. It's just plain text containing 2 characters.

If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.

NOW, the important part:

You cannot store one backslash by using r, it's the only exception.
So this code will produce error: r'\'

To store a backslash (only one) you need to use '\\'

If you want to store more than 1 characters you can still use r like r'\\' will produce 2 backslashes as you expected.

I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet. I hope that it is a bug.

仅此而已 2024-08-25 07:49:06

“u”前缀表示该值的类型为 unicode 而不是 str

带有“r”前缀的原始字符串文字会对其中的任何转义序列进行转义,因此 len(r"\n") 为 2。由于它们会转义转义序列,因此您不能使用以下字符来结束字符串文字单个反斜杠:这不是有效的转义序列(例如 r"\")。

“Raw”不是类型的一部分,它只是表示值的一种方式。例如,"\\n"r"\n" 是相同的值,就像 320x20 一样> 和 0b100000 相同。

您可以使用 unicode 原始字符串文字:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

源文件编码仅决定如何解释源文件,它不会影响表达式或类型。但是,建议避免使用 ASCII 以外的编码来更改代码意义:

使用 ASCII(或 UTF-8,对于 Python 3.0)的文件不应具有编码 cookie。仅当注释或文档字符串需要提及需要 Latin-1 的作者姓名时才应使用 Latin-1(或 UTF-8);否则,使用 \x、\u 或 \U 转义符是在字符串文字中包含非 ASCII 数据的首选方法。

A "u" prefix denotes the value has type unicode rather than str.

Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (e.g. r"\").

"Raw" is not part of the type, it's merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.

You can have unicode raw string literals:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise. However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:

Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.

傾旎 2024-08-25 07:49:06

Unicode 字符串文字

Unicode 字符串文字(以 u 为前缀的字符串文字)为 在 Python 3 中不再使用。它们仍然有效,但是 只是为了与 Python 兼容 2.

原始字符串文字

如果您想创建一个仅包含易于输入的字符(如英文字母或数字)的字符串文字,您只需输入它们:'hello world'。但如果您还想包含一些更奇特的角色,则必须使用一些解决方法。

解决方法之一是转义序列。例如,您可以通过在字符串文字中添加两个易于输入的字符 \n 来表示字符串中的新行。因此,当您打印 'hello\nworld' 字符串时,这些单词将打印在单独的行上。这非常方便!

另一方面,有时您可能希望将实际字符 \n 包含到字符串中 - 您可能不希望它们被解释为新行。看一下这些示例:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

在这种情况下,您只需在字符串文字前面添加 r 字符,如下所示:r'hello\nworld' 并且不会解释转义序列Python。该字符串将完全按照您创建的方式打印。

原始字符串文字不完全是“原始”吗?

许多人期望原始字符串文字在某种意义上是原始的,即“Python 会忽略引号之间的任何内容”。那不是真的。 Python 仍然可以识别所有转义序列,只是不解释它们——而是让它们保持不变。这意味着原始字符串文字仍然必须是有效的字符串文字

来自字符串文字的词法定义

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

很明显,包含裸引号字符的字符串文字(原始或非原始):'hello'world'或以反斜杠结尾:'hello world\'不是有效的。

Unicode string literals

Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.

Raw string literals

If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround.

One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!

On the other hand, sometimes you might want to include the actual characters \ and n into your string – you might not want them to be interpreted as a new line. Look at these examples:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.

Raw string literals are not completely "raw"?

Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.

From the lexical definition of a string literal:

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.

烟花易冷人易散 2024-08-25 07:49:06

来生成字符串 '\'

也许这很明显,也许不是,但是您可以通过调用 x=chr(92) Python2

x=chr(92)
print type(x), len(x) # <type 'str'> 1

y='\\'
print type(y), len(y) # <type 'str'> 1

x==y   # True
x is y # False

Python3 (3.11.1)

x=chr(92)
print(type(x), len(x)) # <class 'str'> 1
# Note The Type Change To Class

y='\\'
print(type(y), len(y)) # <class 'str'> 1
# Note The Type Change To Class

x==y   # True
x is y # True
# Note this is now True

Maybe this is obvious, maybe not, but you can make the string '\' by calling x=chr(92)

Python2

x=chr(92)
print type(x), len(x) # <type 'str'> 1

y='\\'
print type(y), len(y) # <type 'str'> 1

x==y   # True
x is y # False

Python3 (3.11.1)

x=chr(92)
print(type(x), len(x)) # <class 'str'> 1
# Note The Type Change To Class

y='\\'
print(type(y), len(y)) # <class 'str'> 1
# Note The Type Change To Class

x==y   # True
x is y # True
# Note this is now True
东京女 2024-08-25 07:49:06

r-字符串示例(原始字符串)

只是为了提供稍微“面向示例”的教学法,并注意边缘情况:

语法含义注释
'a\nb'a, \n, b带换行符的“常规”字符串
r'a \nb'a\nbr string: \ 不再创建魔术字符
r'a\\b'a, \, \, b\ 甚至没有转义自身,我们得到两个 \
r'a\'b 'a, \, ', b唯一的 \“转义”是 ' 本身。但 ' 仍然出现在字符串中。
r'a'b'语法错误单引号不平衡,我们需要像上面那样的 \
r"a'b"a, ', b来获得单引号
我们可以使用 " 代替r'''a' "b'''a, ', ", b我们可以同时得到 '" 使用三引号
'a\'\'\'"""b'a, <代码>'、''""", b在原始字符串中不可能同时包含三重 ' 和三重 "https://stackoverflow.com /questions/4630465/how-to-include-a-double-quote-and-or-single-quote-character-in-a-raw-python-stri|

最简单的自己玩的方法是使用 如何将字符串拆分为字符列表? 例如:

>>> list(r'a\nb')
['a', '\\', 'n', 'b']

r-strings 的应用:它消除了转义 \ 的需要,这在正则表达式中很常见

例如,如果您想匹配 ISO 日期 yyyy-mm-dd,而不需要 r,这是最干净的方式会写成:

re.compile('\\d{4}-\\d{2}-\\d{2}')

因为 \ 必须出现在正则表达式看到的最终字符串中。如果您这样做的话,它实际上也可以在某些 Python 版本中工作:

re.compile('\d{4}-\d{2}-\d{2}')

因为 \d 不是有效的转义序列,并且会被解释为 \ + d。但这很令人困惑,因为很难记住什么是有效转义或无效转义(\a\b\f、<代码>\n、<代码>\r、<代码>\t、<代码>\v
是有效的,名称是“垂直选项卡”???),如果您这样做,Python 3.12 已经发出警告:

<stdin>:1: SyntaxWarning: invalid escape sequence '\d'

另请参阅:如何修复 Python 中的“SyntaxWarning:无效转义序列”?

因此,使用 r 字符串,我们可以编写更简单的代码:

re.compile(r'\d{4}-\d{2}-\d{2}')

这要多得多更具可读性和理智。

r 字符串的缺点是字符串中不能包含换行符等魔术字符。但这些在正则表达式中并不常见。

u 字符串是 Python 3 中的默认值(Unicode 字符串)

在 Python 3 中,'abc'u'abc 相同'u 语法的存在只是为了帮助代码向后兼容,并且永远不需要。

要获取 Python 2 'abc'(字节字符串),您必须在 Python 3 中执行 b'abc'

另请参阅:Python 字符串中的 u 前缀是什么?

Unicode 字符串与字节字符串

字节字符串只能有“ASCII 字符”,而 Unicode 字符串可以有任何 Unicode 字符。

例如,在 Python 3 中,如果我们使用 é,则 'e' 带有 尖音重音 例如在法语和葡萄牙语中出现,并以 UTF-8 0xC3 + 0xA9 编码为两个字节,我们得到:

>>> list('aéi')
['a', 'é', 'i']

>>> list(b'aéi')
  File "<stdin>", line 1
    list(b'aéi')
         ^^^^^^
SyntaxError: bytes can only contain ASCII literal characters

>>> list(map(lambda x: hex(x), list(bytes('aéi', 'utf8'))))
['0x61', '0xc3', '0xa9', '0x69']

所以我们看到:

  • 'aéi' 包含三个 Unicode 字符。例如 'aéi'[1] 给出 é 作为直观预期的
  • bytes('aéi', 'utf8') 包含四个字节,因为é由两个字节组成

r-strings by example (raw strings)

Just to provide a slightly more "example-oriented" pedagogy, with an eye out for the edge cases:

SyntaxMeaningNote
'a\nb'a, \n, b"Regular" string with a newline
r'a\nb'a, \, n, br string: \ does not create magic characters anymore
r'a\\b'a, \, \, b\ doesn't even escape itself, we get two \
r'a\'b'a, \, ', bThe only thing that \ "escapes" in r-strings is ' itself. But the ' still appears in the string.
r'a'b'Syntax errorUnbalanced single quotes, we'd need the \ like above
r"a'b"a, ', bWe can get a single quote in by using " instead
r'''a'"b'''a, ', ", bWe can get both ' and " in by using triple quotes
'a\'\'\'"""b'a, ', ', ', ", ", ", bIt is impossible to have both triple ' and triple " in a raw string: https://stackoverflow.com/questions/4630465/how-to-include-a-double-quote-and-or-single-quote-character-in-a-raw-python-stri|

The easiest way to play with this yourself is to convert the input string literal to a list of characters with the list() function as mentioned at How do I split a string into a list of characters? e.g.:

>>> list(r'a\nb')
['a', '\\', 'n', 'b']

Application of r-strings: it removes the need to escape \, common in regexes

E.g. if you want to match ISO dates yyyy-mm-dd, without r, the cleanest way would be to write:

re.compile('\\d{4}-\\d{2}-\\d{2}')

because the \ has to be present in the final string seen by regexp. It would also actually work in certain Python versions if you did just:

re.compile('\d{4}-\d{2}-\d{2}')

because \d is not a valid escape sequence and gets interpreted as \ + d. But that is confusing, as it is hard to remember what is a valid escape or not (\a, \b, \f, \n, \r, \t, \v
are valid, what in the name is a "Vertical Tab"???), and Python 3.12 already gives a warning if you do that:

<stdin>:1: SyntaxWarning: invalid escape sequence '\d'

see also: How to fix "SyntaxWarning: invalid escape sequence" in Python?

So with r-string we can write the simpler:

re.compile(r'\d{4}-\d{2}-\d{2}')

which is much more readable and sane.

The downside of r strings is that you then can't have magic characters like newline in your string. But these are not very common in regular expressions.

u strings are the default in Python 3 (Unicode strings)

In Python 3, 'abc' is the same as u'abc', and the u syntax exists just to help with code backward compatibility and is never needed.

And to get a Python 2 'abc' (byte string), you have to do b'abc' in Python 3.

See also: What's the u prefix in a Python string?

Unicode string vs byte string

The byte string can only have "ASCII characters", while the Unicode string can have any Unicode character.

For example, in Python 3, if we play around with é, an 'e' with an acute accent present e.g. in French and Portuguese, and encoded as two bytes in UTF-8 0xC3 + 0xA9 we get:

>>> list('aéi')
['a', 'é', 'i']

>>> list(b'aéi')
  File "<stdin>", line 1
    list(b'aéi')
         ^^^^^^
SyntaxError: bytes can only contain ASCII literal characters

>>> list(map(lambda x: hex(x), list(bytes('aéi', 'utf8'))))
['0x61', '0xc3', '0xa9', '0x69']

so we see that:

  • 'aéi' contains three Unicode characters. Doing e.g. 'aéi'[1] gives é as intuitively expected
  • bytes('aéi', 'utf8') contains four bytes, because the é is made up of two bytes
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文