在 python 3.x 中使用混合编码的字符串

发布于 2025-01-03 16:25:02 字数 292 浏览 1 评论 0原文

我正在使用一个二进制文件,该文件使用绝对路径引用另一个文件。 该路径包含日语和 ASCII 字符。

字符串的长度是给定的,所以我可以读取那么多字节并将其转换为字符串。

然而问题是试图转换字符串。如果我将编码指定为 ascii,则日文字符将失败。如果我将其指定为日语编码(shift-jis 或其他),它将无法正确读取英文字符。

每个 ASCII 字符使用一个字节,每个日语字符使用两个字节。

将这些字节转换为字符串的最快、最干净的方法是什么?编码是已知的。同样的技术可以在旧版本的 python 中使用吗?

I'm working with a binary file that references another file using absolute paths.
The path contains both japanese and ascii characters.

The length of the string is given, so I can just read that many bytes and convert it into a string.

However the problem is trying to convert the string. If I specify the encoding as ascii, it'll fail on the japanese characters. If I specify it as japanese encoding (shift-jis or something), it won't read the english characters properly.

One byte is used for each ascii character, while two bytes are used for each japanese character.

What is the fastest and cleanest way to convert these bytes into a string? The encodings are known. Will the same technique work in older versions of python.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

粉红×色少女 2025-01-10 16:25:02

听起来您似乎因误解 Unicode 和编码的基础知识而成为受害者。您可能没有,但误解是常见且可以理解的,而您所描述的情况则不然。

根据定义,包含混合编码的字节字符串在任何这些编码中都是无效的。如果确实如此,您将必须将字节字符串拆分为多个部分,并分别解码每个部分。在这种情况下,这可能意味着在路径分隔符上进行拆分,因此这将相当容易,但在其他情况下则不然。然而,我严重怀疑情况是否如此,因为这意味着你的消息来源是疯狂的。这种情况确实会发生,但可能性不大。 :-)

如果源为您提供一个字节字符串路径,则该字符串很可能仅使用一种编码。 它可能同时包含日语和 ASCII 字符,但仍使用一种编码。可以同时处理日语和 ASCII 的最常见编码是 UTF-8 和 UTF-16。我的猜测是你的消息来源使用其中之一。事实上,由于您写的是“每个 ASCII 字符使用一个字节,每个日语字符使用两个字节”,因此它可能是 UTF-8。也可能是 Shift JIS,但看来您已经尝试过。

如果没有,请解释您的来源是什么,并给出给您的字节字符串(ASCII/十六进制)的示例。

This sounds like you have fallen victim for a misunderstand the basics of Unicode and encodings. It may be that you have not, but misunderstandnings are common and understandable, while the situation you describe are not.

A string of bytes that contains mixed encodings are, per definition, invalid in any of these encodings. If this really was the case, you would have to split the bytes string into it's parts, and decode every part separately. In this case it would probably mean splitting on the path separators, so it would be reasonably easy, but in other cases it would not. However, I serously doubt that this is the case, as it would mean that your source is insane. That happens, but it is unlikely. :-)

If the source gives you one path as a bytes string, it is most likely that this string uses only one encoding. It may contain both Japanese and ASCII-characters and still be using one encoding. The most common encodings that can handle both Japanese and ASCII are UTF-8 and UTF-16. My guess is that your source uses one of those. In fact, since you write "One byte is used for each ascii character, while two bytes are used for each japanese character" it is probably UTF-8. It could also be Shift JIS, but it seems you already tried that.

If not, please explain what your source is, and give examples of the byte strings (in ASCII/HEX) that you are given.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文