UTF8 编码数据会误读为 Latin-1 会产生 ASCII 伪影吗?
UTF-8 单字节字符完美映射到 Latin-1 (ISO 8859-1) 字符(低于字符代码 128 的字符);基本上是默认的 ASCII 字符。 如果我有一个 UTF-8 编码的字符串并将其传递给需要 Latin-1 字符串的函数,Latin-1 函数是否有可能将部分 UTF-8 多字节字符误解为 ASCII 字符?
我想象这样的事情可能发生:
(imagniray)UTF-8多字节字符:0xA330
由 Latin-1 函数(错误)解释为两个 Latin-1 字符: 0xA3
0x30
其中第一个字符不在 ASCII 集中,但第二个是0 字符的 ASCII 代码。多字节 UTF-8 字符是否有可能产生看起来像上例中的单字节 UTF-8 / ASCII 字符的伪像?
根据我对 UTF-8 的理解,只有单字节字符包含未设置最高有效位的任何字节,因此基本上多字节字符永远不会包含可能被 Latin-1 函数误解为有效 ASCII 字符的字节(因为所有这些字符最高有效位未设置)。但我想确保这是真的,并且我不会搞砸这一点,因为这在处理数据清理时可能会产生安全隐患 - 我显然目前正在这样做。
UTF-8 single byte characters map perfectly to Latin-1 (ISO 8859-1) characters (those below the character code of 128); basicly the default ASCII characters.
If I have a UTF-8 encoded string and pass it to a function, that expects a Latin-1 string is there any possibility that the Latin-1 function misinterprets parts of UTF-8 mutlibyte characters as ASCII characters?
I imagine something like this could happen:
(imagniray) UTF-8 multibyte character: 0xA330
(mis-)interpreted by Latin-1 function as two Latin-1 characters: 0xA3
0x30
The first of those characters does not lie within the ASCII set, but the second is the ASCII code for the 0 character. Is it possible that an multibyte UTF-8 character produces an artifact that looks like a single-byte UTF-8 / ASCII character like in the example above?
From my understanding of UTF-8 only single-byte characters contain any bytes with the most significant bit unset, so basicly multibyte characters never contain a byte that could be misinterpreted by a Latin-1 function as a valid ASCII character (because all those characters have the most significant bit unset). But I want to make sure this is true and I don't screw up on this, because this may have security implications when dealing with data sanitization - which I am apparently currently doing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的理解是正确的,只有单字节字符包含未设置最高有效位的任何字节。有一个很好的表格显示了这一点: http://en.wikipedia.org/wiki/ UTF-8#描述
You are correct in your understanding that only single byte characters contain any bytes with the most significant bit unset. There is a nice table showing this at: http://en.wikipedia.org/wiki/UTF-8#Description