PHP如何将UTF-8转换为MUTF-8?

发布于 2025-01-25 01:39:55 字数 784 浏览 4 评论 0原文

在PHP中,如何将UTF-8转换为 mutf-8 ?我希望我能懒惰地逃脱吗

function utf8_to_mutf8(string $utf8):string{
    return str_replace("\x00", "\xC0\x80", $utf8);
}

?鉴于 UTF-8中的所有多字节字符都具有高钻头集,\ x00永远不会在任何多字节字符中发生,并且以下内容应该完全不必要吗?

function utf8_to_mutf8(string $utf8):string{
    $old = mb_internal_encoding();
    mb_internal_encoding("UTF-8");
    $ret = mb_ereg_replace("\x00", "\xC0\x80",$utf8);
    mb_internal_encoding($old);
    return $ret;
}

in PHP, how can i convert UTF-8 to MUTF-8? i am hoping i can lazily just get away with

function utf8_to_mutf8(string $utf8):string{
    return str_replace("\x00", "\xC0\x80", $utf8);
}

? given that all multi-byte characters in utf-8 have the high bit set, \x00 will never occur in any multi-byte character, and the following should be completely unnecessary?

function utf8_to_mutf8(string $utf8):string{
    $old = mb_internal_encoding();
    mb_internal_encoding("UTF-8");
    $ret = mb_ereg_replace("\x00", "\xC0\x80",$utf8);
    mb_internal_encoding($old);
    return $ret;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

手心的海 2025-02-01 01:39:55
  • 是的,“ \ x00”仅对CodePoint U+0000发生,而不会出现任何其他CodePoint。 只有所有ASCII字符都没有设置最高的位置(u+0000 to U+0000 to U+0000 +007F = 00000000至01111111)。遇到没有最高位集的字节也可以用于Sychronization,如果不清楚下一个CodePoint/Charem在哪里开始。
  • 是的, str_replace() 足够了,因为它已经二进制安全,如文档中所述。说话:它既不关心输入的编码,也不关心全球设置。

如果您的目标是拥有一系列字节链,这些字节永远不会在其中拥有“ \ x00”,那么您应该以这种方式实现它。

我个人认为 null终止 遵循旧的Java工作方式围绕该限制,只有无法以第一种方式使用“ \ x00”的相同缺点。您最终会再次不修改您的编码,以便所有UTF-8处理正确处理。

  • Yes, "\x00" will only occur for codepoint U+0000 and never for any other codepoint. Only all ASCII characters have the highest bit not set (U+0000 to U+007F = bits 00000000 to 01111111). Encountering bytes that have not the highest bit set can also be used for sychronization in case it is unclear where the next codepoint/character begins.
  • Yes, str_replace() is enough, because it is already binary safe, as said in the docs. Speak: it does neither care about the input's encoding, nor about global settings.

If your goal is to have a chain of bytes that will never ever have a "\x00" in it then you should achieve it this way.

Personally I think null terminations are outdated, and following the old Java way to work around that limitation just comes with the same disadvantages of not being able to use "\x00" in the first way. You just end up to unmodify your encoding again to let all UTF-8 handling properly deal with it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文