在 Python 3 中将字符串转换为字节的最佳方法?

发布于 2024-12-06 20:39:52 字数 1433 浏览 1 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

最佳男配角 2024-12-13 20:39:52

如果您查看 bytes 的文档,它会将您指向 字节数组

bytearray([源[,编码[,错误]]])

返回一个新的字节数组。 bytearray 类型是 0 <= x << 范围内的可变整数序列。 256. 它具有可变序列的大多数常用方法,如可变序列类型中所述,以及字节类型具有的大多数方法,请参阅字节和字节数组方法。

可选的源参数可用于以几种不同的方式初始化数组:

如果是字符串,还必须给出编码(以及可选的错误)参数; bytearray() 然后使用 str.encode() 将字符串转换为字节。

如果它是一个整数,则数组将具有该大小,并将使用空字节进行初始化。

如果是符合buffer接口的对象,则使用该对象的只读buffer来初始化bytes数组。

如果它是一个可迭代对象,则它必须是 0 <= x << 范围内的整数的可迭代对象。 256,用作数组的初始内容。

如果没有参数,将创建一个大小为 0 的数组。

因此 bytes 的用途远不止于此只是编码一个字符串。这是Pythonic,它允许您使用任何有意义的源参数类型来调用构造函数。

对于编码字符串,我认为 some_string.encode(encoding) 比使用构造函数更Pythonic,因为它是最自我记录的——“获取这个字符串并用这种编码对其进行编码”是比bytes(some_string,encoding)更清晰——使用构造函数时没有明确的动词。

我检查了Python源代码。如果使用 CPython 将 unicode 字符串传递给 bytes,它会调用 PyUnicode_AsEncodedString,即encode的实现;因此,如果您自己调用 encode ,您只是跳过了一个间接级别。

另外,请参阅 Serdalis 的评论 - unicode_string.encode(encoding) 也更 Pythonic,因为它的逆是 byte_string.decode(encoding) 并且对称性很好。

If you look at the docs for bytes, it points you to bytearray:

bytearray([source[, encoding[, errors]]])

Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods.

The optional source parameter can be used to initialize the array in a few different ways:

If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode().

If it is an integer, the array will have that size and will be initialized with null bytes.

If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array.

If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.

Without an argument, an array of size 0 is created.

So bytes can do much more than just encode a string. It's Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.

For encoding a string, I think that some_string.encode(encoding) is more Pythonic than using the constructor, because it is the most self documenting -- "take this string and encode it with this encoding" is clearer than bytes(some_string, encoding) -- there is no explicit verb when you use the constructor.

I checked the Python source. If you pass a unicode string to bytes using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode; so you're just skipping a level of indirection if you call encode yourself.

Also, see Serdalis' comment -- unicode_string.encode(encoding) is also more Pythonic because its inverse is byte_string.decode(encoding) and symmetry is nice.

败给现实 2024-12-13 20:39:52

这比想象的要容易:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
print(type(my_str_as_bytes)) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
print(type(my_decoded_str)) # ensure it is string representation

您可以通过打印类型来验证。请参阅下面的输出。

<class 'bytes'>
<class 'str'>

It's easier than it is thought:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
print(type(my_str_as_bytes)) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
print(type(my_decoded_str)) # ensure it is string representation

you can verify by printing the types. Refer to output below.

<class 'bytes'>
<class 'str'>
孤独陪着我 2024-12-13 20:39:52

绝对最好的方法不是这两个,而是第三个。 encode <的第一个参数自 Python 3.0 起默认为 'utf-8'。因此最好的方法是

b = mystring.encode()

这也会更快,因为默认参数不会产生 C 代码中的字符串 "utf-8",而是 NULL< /em>,检查速度快得多

以下是一些计时:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

尽管有警告,但重复运行后时间非常稳定 - 偏差仅为约 2%。


使用不带参数的 encode() 与 Python 2 不兼容,因为在 Python 2 中,默认字符编码是 ASCII

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

The absolutely best way is neither of the 2, but the 3rd. The first parameter to encode defaults to 'utf-8' ever since Python 3.0. Thus the best way is

b = mystring.encode()

This will also be faster, because the default argument results not in the string "utf-8" in the C code, but NULL, which is much faster to check!

Here be some timings:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent.


Using encode() without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII.

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
江湖彼岸 2024-12-13 20:39:52

回答一个稍微不同的问题:

您有一个保存到 str 变量中的原始 unicode 序列:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

您需要能够获取该 unicode 的字节文字(对于 struct.unpack() 等)

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

解决方案:

s_new: bytes = bytes(s, encoding="raw_unicode_escape")

参考(向上滚动查看标准编码):

Python 特定编码

Answer for a slightly different problem:

You have a sequence of raw unicode that was saved into a str variable:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

You need to be able to get the byte literal of that unicode (for struct.unpack(), etc.)

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

Solution:

s_new: bytes = bytes(s, encoding="raw_unicode_escape")

Reference (scroll up for standard encodings):

Python Specific Encodings

灼疼热情 2024-12-13 20:39:52

Python 3 'memoryview< 怎么样? /em>' 方式。

Memoryview 是 byte/bytearray 和 struct 模块的一种混搭,有几个好处。

  • 不仅限于文本和字节,还可以处理 16 位和 32 位字 处理字节
  • 顺序
  • 为链接的 C/C++ 函数和数据提供非常低开销的接口

最简单的示例,对于字节数组:

memoryview(b"some bytes").tolist()

[115, 111, 109, 101, 32, 98, 121, 116, 101, 115]

或者对于 unicode 字符串(它被转换为到字节数组)

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

#Another way to do the same
memoryview("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020".encode("UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

也许您需要单词而不是字节?

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).cast("H").tolist()

[65279, 117, 110, 105, 99, 111, 100, 101, 32]

memoryview(b"some  more  data").cast("L").tolist()

[1701670771, 1869422624, 538994034, 1635017060]

警告。要小心对超过一个字节的数据的字节顺序的多种解释:

txt = "\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020"
for order in ("", "BE", "LE"):
    mv = memoryview(bytes(txt, f"UTF-16{order}"))
    print(mv.cast("H").tolist())

[65279, 117, 110, 105, 99, 111, 100, 101, 32]
[29952, 28160, 26880, 25344, 28416, 25600, 25856, 8192]
[117, 110, 105, 99, 111, 100, 101, 32]

不确定这是故意的还是一个错误,但它让我发现了!

该示例使用 UTF-16,有关编解码器的完整列表,请参阅 编解码器注册表在Python 3.10中

How about the Python 3 'memoryview' way.

Memoryview is a sort of mishmash of the byte/bytearray and struct modules, with several benefits.

  • Not limited to just text and bytes, handles 16 and 32 bit words too
  • Copes with endianness
  • Provides a very low overhead interface to linked C/C++ functions and data

Simplest example, for a byte array:

memoryview(b"some bytes").tolist()

[115, 111, 109, 101, 32, 98, 121, 116, 101, 115]

Or for a unicode string, (which is converted to a byte array)

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

#Another way to do the same
memoryview("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020".encode("UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

Perhaps you need words rather than bytes?

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).cast("H").tolist()

[65279, 117, 110, 105, 99, 111, 100, 101, 32]

memoryview(b"some  more  data").cast("L").tolist()

[1701670771, 1869422624, 538994034, 1635017060]

Word of caution. Be careful of multiple interpretations of byte order with data of more than one byte:

txt = "\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020"
for order in ("", "BE", "LE"):
    mv = memoryview(bytes(txt, f"UTF-16{order}"))
    print(mv.cast("H").tolist())

[65279, 117, 110, 105, 99, 111, 100, 101, 32]
[29952, 28160, 26880, 25344, 28416, 25600, 25856, 8192]
[117, 110, 105, 99, 111, 100, 101, 32]

Not sure if that's intentional or a bug but it caught me out!!

The example used UTF-16, for a full list of codecs see Codec registry in Python 3.10

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文