当前位置：文江博客话题详情

在 Python 3 中将字符串转换为字节的最佳方法？

发布于 2024-12-06 20:39:52 字数 1433 浏览 1 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最佳男配角 2024-12-13 20:39:52

如果您查看 bytes 的文档，它会将您指向 字节数组：

bytearray([源[,编码[,错误]]])
返回一个新的字节数组。 bytearray 类型是 0 <= x << 范围内的可变整数序列。 256. 它具有可变序列的大多数常用方法，如可变序列类型中所述，以及字节类型具有的大多数方法，请参阅字节和字节数组方法。
可选的源参数可用于以几种不同的方式初始化数组：
如果是字符串，还必须给出编码（以及可选的错误）参数； bytearray() 然后使用 str.encode() 将字符串转换为字节。
如果它是一个整数，则数组将具有该大小，并将使用空字节进行初始化。
如果是符合buffer接口的对象，则使用该对象的只读buffer来初始化bytes数组。
如果它是一个可迭代对象，则它必须是 0 <= x << 范围内的整数的可迭代对象。 256，用作数组的初始内容。
如果没有参数，将创建一个大小为 0 的数组。

因此 bytes 的用途远不止于此只是编码一个字符串。这是Pythonic，它允许您使用任何有意义的源参数类型来调用构造函数。

对于编码字符串，我认为 some_string.encode(encoding) 比使用构造函数更Pythonic，因为它是最自我记录的——“获取这个字符串并用这种编码对其进行编码”是比bytes(some_string,encoding)更清晰——使用构造函数时没有明确的动词。

我检查了Python源代码。如果使用 CPython 将 unicode 字符串传递给 bytes，它会调用 PyUnicode_AsEncodedString，即encode的实现；因此，如果您自己调用 encode ，您只是跳过了一个间接级别。

另外，请参阅 Serdalis 的评论 - unicode_string.encode(encoding) 也更 Pythonic，因为它的逆是 byte_string.decode(encoding) 并且对称性很好。

回复收藏 0 原文

败给现实 2024-12-13 20:39:52

这比想象的要容易：

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
print(type(my_str_as_bytes)) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
print(type(my_decoded_str)) # ensure it is string representation

您可以通过打印类型来验证。请参阅下面的输出。

<class 'bytes'>
<class 'str'>

It's easier than it is thought:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
print(type(my_str_as_bytes)) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
print(type(my_decoded_str)) # ensure it is string representation

you can verify by printing the types. Refer to output below.

<class 'bytes'>
<class 'str'>

回复收藏 0 原文

孤独陪着我 2024-12-13 20:39:52

绝对最好的方法不是这两个，而是第三个。 encode <的第一个参数自 Python 3.0 起默认为 'utf-8'。因此最好的方法是

b = mystring.encode()

这也会更快，因为默认参数不会产生 C 代码中的字符串 "utf-8"，而是 NULL< /em>，检查速度快得多！

以下是一些计时：

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

尽管有警告，但重复运行后时间非常稳定 - 偏差仅为约 2%。

使用不带参数的 encode() 与 Python 2 不兼容，因为在 Python 2 中，默认字符编码是 ASCII。

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

The absolutely best way is neither of the 2, but the 3rd. The first parameter to encode defaults to 'utf-8' ever since Python 3.0. Thus the best way is

b = mystring.encode()

This will also be faster, because the default argument results not in the string "utf-8" in the C code, but NULL, which is much faster to check!

Here be some timings:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent.

Using encode() without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII.

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

回复收藏 0 原文

江湖彼岸 2024-12-13 20:39:52

回答一个稍微不同的问题：

您有一个保存到 str 变量中的原始 unicode 序列：

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

您需要能够获取该 unicode 的字节文字（对于 struct.unpack() 等）

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

解决方案：

s_new: bytes = bytes(s, encoding="raw_unicode_escape")

参考（向上滚动查看标准编码）：

Python 特定编码

Answer for a slightly different problem:

You have a sequence of raw unicode that was saved into a str variable:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

You need to be able to get the byte literal of that unicode (for struct.unpack(), etc.)

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

Solution:

s_new: bytes = bytes(s, encoding="raw_unicode_escape")

Reference (scroll up for standard encodings):

Python Specific Encodings

回复收藏 0 原文

灼疼热情 2024-12-13 20:39:52

Python 3 'memoryview< 怎么样？ /em>' 方式。

Memoryview 是 byte/bytearray 和 struct 模块的一种混搭，有几个好处。

不仅限于文本和字节，还可以处理 16 位和 32 位字处理字节
顺序
为链接的 C/C++ 函数和数据提供非常低开销的接口

最简单的示例，对于字节数组：

memoryview(b"some bytes").tolist()

[115, 111, 109, 101, 32, 98, 121, 116, 101, 115]

或者对于 unicode 字符串（它被转换为到字节数组）

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

#Another way to do the same
memoryview("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020".encode("UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

也许您需要单词而不是字节？

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).cast("H").tolist()

[65279, 117, 110, 105, 99, 111, 100, 101, 32]

memoryview(b"some  more  data").cast("L").tolist()

[1701670771, 1869422624, 538994034, 1635017060]

警告。要小心对超过一个字节的数据的字节顺序的多种解释：

txt = "\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020"
for order in ("", "BE", "LE"):
    mv = memoryview(bytes(txt, f"UTF-16{order}"))
    print(mv.cast("H").tolist())

[65279, 117, 110, 105, 99, 111, 100, 101, 32]
[29952, 28160, 26880, 25344, 28416, 25600, 25856, 8192]
[117, 110, 105, 99, 111, 100, 101, 32]

不确定这是故意的还是一个错误，但它让我发现了！

该示例使用 UTF-16，有关编解码器的完整列表，请参阅编解码器注册表在Python 3.10中

How about the Python 3 'memoryview' way.

Memoryview is a sort of mishmash of the byte/bytearray and struct modules, with several benefits.

Not limited to just text and bytes, handles 16 and 32 bit words too
Copes with endianness
Provides a very low overhead interface to linked C/C++ functions and data

Simplest example, for a byte array:

memoryview(b"some bytes").tolist()

[115, 111, 109, 101, 32, 98, 121, 116, 101, 115]

Or for a unicode string, (which is converted to a byte array)

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

#Another way to do the same
memoryview("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020".encode("UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

Perhaps you need words rather than bytes?

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).cast("H").tolist()

[65279, 117, 110, 105, 99, 111, 100, 101, 32]

memoryview(b"some  more  data").cast("L").tolist()

[1701670771, 1869422624, 538994034, 1635017060]

Word of caution. Be careful of multiple interpretations of byte order with data of more than one byte:

txt = "\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020"
for order in ("", "BE", "LE"):
    mv = memoryview(bytes(txt, f"UTF-16{order}"))
    print(mv.cast("H").tolist())

[65279, 117, 110, 105, 99, 111, 100, 101, 32]
[29952, 28160, 26880, 25344, 28416, 25600, 25856, 8192]
[117, 110, 105, 99, 111, 100, 101, 32]

Not sure if that's intentional or a bug but it caught me out!!

The example used UTF-16, for a full list of codecs see Codec registry in Python 3.10

回复收藏 0 原文

~没有更多了~

关于作者

小忆控

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

在 Python 3 中将字符串转换为字节的最佳方法？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

在 Python 3 中将字符串转换为字节的最佳方法？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。