PHP如何将UTF-8转换为MUTF-8?
在PHP中,如何将UTF-8转换为 mutf-8 ?我希望我能懒惰地逃脱吗
function utf8_to_mutf8(string $utf8):string{
return str_replace("\x00", "\xC0\x80", $utf8);
}
?鉴于 UTF-8中的所有多字节字符都具有高钻头集,\ x00永远不会在任何多字节字符中发生,并且以下内容应该完全不必要吗?
function utf8_to_mutf8(string $utf8):string{
$old = mb_internal_encoding();
mb_internal_encoding("UTF-8");
$ret = mb_ereg_replace("\x00", "\xC0\x80",$utf8);
mb_internal_encoding($old);
return $ret;
}
in PHP, how can i convert UTF-8 to MUTF-8? i am hoping i can lazily just get away with
function utf8_to_mutf8(string $utf8):string{
return str_replace("\x00", "\xC0\x80", $utf8);
}
? given that all multi-byte characters in utf-8 have the high bit set, \x00 will never occur in any multi-byte character, and the following should be completely unnecessary?
function utf8_to_mutf8(string $utf8):string{
$old = mb_internal_encoding();
mb_internal_encoding("UTF-8");
$ret = mb_ereg_replace("\x00", "\xC0\x80",$utf8);
mb_internal_encoding($old);
return $ret;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
“ \ x00”
仅对CodePoint U+0000发生,而不会出现任何其他CodePoint。 只有所有ASCII字符都没有设置最高的位置(u+0000 to U+0000 to U+0000 +007F = 00000000至01111111)。遇到没有最高位集的字节也可以用于Sychronization,如果不清楚下一个CodePoint/Charem在哪里开始。str_replace()
足够了,因为它已经二进制安全,如文档中所述。说话:它既不关心输入的编码,也不关心全球设置。如果您的目标是拥有一系列字节链,这些字节永远不会在其中拥有
“ \ x00”
,那么您应该以这种方式实现它。我个人认为 null终止 遵循旧的Java工作方式围绕该限制,只有无法以第一种方式使用
“ \ x00”
的相同缺点。您最终会再次不修改您的编码,以便所有UTF-8处理正确处理。"\x00"
will only occur for codepoint U+0000 and never for any other codepoint. Only all ASCII characters have the highest bit not set (U+0000 to U+007F = bits 00000000 to 01111111). Encountering bytes that have not the highest bit set can also be used for sychronization in case it is unclear where the next codepoint/character begins.str_replace()
is enough, because it is already binary safe, as said in the docs. Speak: it does neither care about the input's encoding, nor about global settings.If your goal is to have a chain of bytes that will never ever have a
"\x00"
in it then you should achieve it this way.Personally I think null terminations are outdated, and following the old Java way to work around that limitation just comes with the same disadvantages of not being able to use
"\x00"
in the first way. You just end up to unmodify your encoding again to let all UTF-8 handling properly deal with it.