If you look at the docs for bytes, it points you to bytearray:
bytearray([source[, encoding[, errors]]])
Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods.
The optional source parameter can be used to initialize the array in a few different ways:
If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode().
If it is an integer, the array will have that size and will be initialized with null bytes.
If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array.
If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.
Without an argument, an array of size 0 is created.
So bytes can do much more than just encode a string. It's Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.
For encoding a string, I think that some_string.encode(encoding) is more Pythonic than using the constructor, because it is the most self documenting -- "take this string and encode it with this encoding" is clearer than bytes(some_string, encoding) -- there is no explicit verb when you use the constructor.
I checked the Python source. If you pass a unicode string to bytes using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode; so you're just skipping a level of indirection if you call encode yourself.
Also, see Serdalis' comment -- unicode_string.encode(encoding) is also more Pythonic because its inverse is byte_string.decode(encoding) and symmetry is nice.
这也会更快,因为默认参数不会产生 C 代码中的字符串 "utf-8",而是 NULL< /em>,检查速度快得多!
以下是一些计时:
In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop
In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop
>>> 'äöä'.encode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
The absolutely best way is neither of the 2, but the 3rd. The first parameter to encodedefaults to'utf-8' ever since Python 3.0. Thus the best way is
b = mystring.encode()
This will also be faster, because the default argument results not in the string "utf-8" in the C code, but NULL, which is much faster to check!
Here be some timings:
In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop
In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop
Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent.
Using encode() without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII.
>>> 'äöä'.encode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
发布评论
评论(5)
如果您查看
bytes
的文档,它会将您指向字节数组
:因此
bytes
的用途远不止于此只是编码一个字符串。这是Pythonic,它允许您使用任何有意义的源参数类型来调用构造函数。对于编码字符串,我认为
some_string.encode(encoding)
比使用构造函数更Pythonic,因为它是最自我记录的——“获取这个字符串并用这种编码对其进行编码”是比bytes(some_string,encoding)
更清晰——使用构造函数时没有明确的动词。我检查了Python源代码。如果使用 CPython 将 unicode 字符串传递给
bytes
,它会调用 PyUnicode_AsEncodedString,即encode
的实现;因此,如果您自己调用encode
,您只是跳过了一个间接级别。另外,请参阅 Serdalis 的评论 -
unicode_string.encode(encoding)
也更 Pythonic,因为它的逆是byte_string.decode(encoding)
并且对称性很好。If you look at the docs for
bytes
, it points you tobytearray
:So
bytes
can do much more than just encode a string. It's Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.For encoding a string, I think that
some_string.encode(encoding)
is more Pythonic than using the constructor, because it is the most self documenting -- "take this string and encode it with this encoding" is clearer thanbytes(some_string, encoding)
-- there is no explicit verb when you use the constructor.I checked the Python source. If you pass a unicode string to
bytes
using CPython, it calls PyUnicode_AsEncodedString, which is the implementation ofencode
; so you're just skipping a level of indirection if you callencode
yourself.Also, see Serdalis' comment --
unicode_string.encode(encoding)
is also more Pythonic because its inverse isbyte_string.decode(encoding)
and symmetry is nice.这比想象的要容易:
您可以通过打印类型来验证。请参阅下面的输出。
It's easier than it is thought:
you can verify by printing the types. Refer to output below.
绝对最好的方法不是这两个,而是第三个。
encode
<的第一个参数自 Python 3.0 起默认为'utf-8'
。因此最好的方法是这也会更快,因为默认参数不会产生 C 代码中的字符串
"utf-8"
,而是NULL
< /em>,检查速度快得多!以下是一些计时:
尽管有警告,但重复运行后时间非常稳定 - 偏差仅为约 2%。
使用不带参数的
encode()
与 Python 2 不兼容,因为在 Python 2 中,默认字符编码是 ASCII。The absolutely best way is neither of the 2, but the 3rd. The first parameter to
encode
defaults to'utf-8'
ever since Python 3.0. Thus the best way isThis will also be faster, because the default argument results not in the string
"utf-8"
in the C code, butNULL
, which is much faster to check!Here be some timings:
Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent.
Using
encode()
without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII.回答一个稍微不同的问题:
您有一个保存到 str 变量中的原始 unicode 序列:
您需要能够获取该 unicode 的字节文字(对于 struct.unpack() 等)
解决方案:
参考(向上滚动查看标准编码):
Python 特定编码
Answer for a slightly different problem:
You have a sequence of raw unicode that was saved into a str variable:
You need to be able to get the byte literal of that unicode (for struct.unpack(), etc.)
Solution:
Reference (scroll up for standard encodings):
Python Specific Encodings
Python 3 'memoryview< 怎么样? /em>' 方式。
Memoryview 是 byte/bytearray 和 struct 模块的一种混搭,有几个好处。
最简单的示例,对于字节数组:
或者对于 unicode 字符串(它被转换为到字节数组)
也许您需要单词而不是字节?
警告。要小心对超过一个字节的数据的字节顺序的多种解释:
不确定这是故意的还是一个错误,但它让我发现了!
该示例使用 UTF-16,有关编解码器的完整列表,请参阅 编解码器注册表在Python 3.10中
How about the Python 3 'memoryview' way.
Memoryview is a sort of mishmash of the byte/bytearray and struct modules, with several benefits.
Simplest example, for a byte array:
Or for a unicode string, (which is converted to a byte array)
Perhaps you need words rather than bytes?
Word of caution. Be careful of multiple interpretations of byte order with data of more than one byte:
Not sure if that's intentional or a bug but it caught me out!!
The example used UTF-16, for a full list of codecs see Codec registry in Python 3.10