将零填充字节转换为 UTF-8 字符串
我正在解压几个包含's'
来自 C 的类型字段。这些字段包含由 < 处理的零填充 UTF-8 字符串C 代码中的 href="http://linux.die.net/man/3/strncpy">strncpy
(注意此函数的残余行为)。如果我解码字节,我会得到一个 unicode 字符串,末尾有很多 NUL
字符。
>>> b'hiya\0\0\0'.decode('utf8')
'hiya\x00\x00\x00'
我的印象是尾随零字节是 UTF-8 的一部分,会自动删除。
删除零字节的正确方法是什么?
I'm unpacking several structs that contain 's'
type fields from C. The fields contain zero-padded UTF-8 strings handled by strncpy
in the C code (note this function's vestigial behaviour). If I decode the bytes I get a unicode string with lots of NUL
characters on the end.
>>> b'hiya\0\0\0'.decode('utf8')
'hiya\x00\x00\x00'
I was under the impression that trailing zero bytes were part of UTF-8 and would be dropped automatically.
What's the proper way to drop the zero bytes?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
仅当字符串用空值填充到缓冲区末尾时,
rstrip
或replace
才有效。实际上,缓冲区可能一开始就没有初始化为 null,因此您可能会得到类似b'hiya\0x\0'
的内容。如果您明确 100% 知道 C 代码以空初始化缓冲区开始,并且从不重复使用它,那么您可能会发现 rstrip 更简单,否则我会选择稍微混乱但更安全:
它将第一个 null 视为终止符。
Either
rstrip
orreplace
will only work if the string is padded out to the end of the buffer with nulls. In practice the buffer may not have been initialised to null to begin with so you might get something likeb'hiya\0x\0'
.If you know categorically 100% that the C code starts with a null initialised buffer and never never re-uses it, then you might find
rstrip
to be simpler, otherwise I'd go for the slightly messier but much safer:which treats the first null as a terminator.
使用
str.rstrip()
删除尾随的 NUL:Use
str.rstrip()
to remove the trailing NULs:与分割/分区解决方案不同,它不会复制多个字符串,并且对于长字节数组可能会更快。
Unlike the split/partition-solution this does not copy several strings and might be faster for long bytearrays.
我发现这是一个巧妙的解决方案:
I found this to be a neat solution: