PHP htmlspecialchars 函数中的 Unicode 替换字符
在 htmlspecialchars 函数中,如果设置了 ENT_SUBSTITUTE 标志,则应该替换一些无效字符。
哪些字符被替换?无效字符和用于替换它的字符之间的映射是什么?
In the htmlspecialchars function, if you set the ENT_SUBSTITUTE flag, it is supposed to replace some invalid characters.
What characters are replaced? And what is the mapping between the invalid characters and the ones that are used to replace it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
只有一个通用替换字符:U+FFFD。如果您写出 UTF-8,则该代码点已正确编码。如果没有,您将获得相应的字符引用
�
。不存在可逆映射。根据定义,原始字节序列无效,即它没有值(有效=有值)。
被替换的字节(不是真正的“字符”)是那些在假定的源编码中无效的字节。例如,如果您的源编码是 UTF-16 并且您有一个单独的代理,那么这将是“无效”(尽管从技术上讲,任何文本处理器都应该在这种情况下致命中止)。作为一个更好的示例,如果源编码是 ASCII,则任何高于 127 的值都是无效字符。
There is only one, universal replacement character: U+FFFD. If you are writing out UTF-8, then this codepoint is appropriately encoded. If not, you get the corresponding character reference
�
instead.There is no reversible mapping. By definition, the original byte sequence was invalid, i.e. it does not have a value (valid = has a value).
Bytes (not really "characters") that are replaced are those that are not valid in the assumed source encoding. For example, if your source encoding was UTF-16 and you had a lone surrogate, that would be "invalid" (though technically any text processor is supposed to abort fatally in that situation). As a better example, if the source encoding is ASCII, then any value above 127 is an invalid character.