将零填充字节转换为 UTF-8 字符串

发布于 2024-10-18 06:52:57 字数 515 浏览 8 评论 0原文

我正在解压几个包含's' 来自 C 的类型字段。这些字段包含由 < 处理的零填充 UTF-8 字符串C 代码中的 href="http://linux.die.net/man/3/strncpy">strncpy （注意此函数的残余行为）。如果我解码字节，我会得到一个 unicode 字符串，末尾有很多 NUL 字符。

>>> b'hiya\0\0\0'.decode('utf8')
'hiya\x00\x00\x00'

我的印象是尾随零字节是 UTF-8 的一部分，会自动删除。

删除零字节的正确方法是什么？

原文

I'm unpacking several structs that contain 's' type fields from C. The fields contain zero-padded UTF-8 strings handled by strncpy in the C code (note this function's vestigial behaviour). If I decode the bytes I get a unicode string with lots of NUL characters on the end.

>>> b'hiya\0\0\0'.decode('utf8')
'hiya\x00\x00\x00'

I was under the impression that trailing zero bytes were part of UTF-8 and would be dropped automatically.

What's the proper way to drop the zero bytes?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

左耳近心 2024-10-25 06:52:57

仅当字符串用空值填充到缓冲区末尾时，rstrip 或 replace 才有效。实际上，缓冲区可能一开始就没有初始化为 null，因此您可能会得到类似 b'hiya\0x\0' 的内容。

如果您明确 100% 知道 C 代码以空初始化缓冲区开始，并且从不重复使用它，那么您可能会发现 rstrip 更简单，否则我会选择稍微混乱但更安全：

>>> b'hiya\0x\0'.split(b'\0',1)[0]
b'hiya'

它将第一个 null 视为终止符。

Either rstrip or replace will only work if the string is padded out to the end of the buffer with nulls. In practice the buffer may not have been initialised to null to begin with so you might get something like b'hiya\0x\0'.

If you know categorically 100% that the C code starts with a null initialised buffer and never never re-uses it, then you might find rstrip to be simpler, otherwise I'd go for the slightly messier but much safer:

>>> b'hiya\0x\0'.split(b'\0',1)[0]
b'hiya'

which treats the first null as a terminator.

回复收藏 0 原文

滴情不沾 2024-10-25 06:52:57

使用 str.rstrip() 删除尾随的 NUL：

>>> 'hiya\0\0\0'.rstrip('\0')
'hiya'

Use str.rstrip() to remove the trailing NULs:

>>> 'hiya\0\0\0'.rstrip('\0')
'hiya'

回复收藏 0 原文

叫嚣ゝ 2024-10-25 06:52:57

与分割/分区解决方案不同，它不会复制多个字符串，并且对于长字节数组可能会更快。

data = b'hiya\0\0\0'
i = data.find(b'\x00')
if i == -1:
  return data
return data[:i]

Unlike the split/partition-solution this does not copy several strings and might be faster for long bytearrays.

data = b'hiya\0\0\0'
i = data.find(b'\x00')
if i == -1:
  return data
return data[:i]

回复收藏 0 原文

苍风燃霜 2024-10-25 06:52:57

我发现这是一个巧妙的解决方案：

''.join(chr(b) if b else '' for b in b'\0hello\0\0)

I found this to be a neat solution:

''.join(chr(b) if b else '' for b in b'\0hello\0\0)

回复收藏 0 原文

~没有更多了~

关于作者

裂开嘴轻声笑有多痛

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

将零填充字节转换为 UTF-8 字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

将零填充字节转换为 UTF-8 字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。