解析 UTF-8 时防止形式过长
作为个人练习,我一直在研究另一个 UTF-8 解析器,虽然我的实现工作得很好,并且它拒绝大多数格式错误的序列(用 U+FFFD 替换它们),但我似乎不知道如何实现拒绝超长的形式。谁能告诉我该怎么做?
伪代码:
let w = 0, // the number of continuation bytes pending
c = 0, // the currently being constructed codepoint
b, // the current byte from the source stream
valid(c) = (
(c < 0x110000) &&
((c & 0xFFFFF800) != 0xD800) &&
((c < 0xFDD0) || (c > 0xFDEF)) &&
((c & 0xFFFE) != 0xFFFE))
for each b:
if b < 0x80:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = 0
append U+b to output string
else if b < 0xc0:
if w == 0: // unwanted continuation byte
append U+FFFD to output string
else:
c |= (b & 0x3f) << (--w * 6)
if w == 0: // done
if valid(c):
append U+c to output string
else if b < 0xfe:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = (b < 0xe0) ? 1 :
(b < 0xf0) ? 2 :
(b < 0xf8) ? 3 :
(b < 0xfc) ? 4 : 5;
c = (b & ((1 << (6 - w)) - 1)) << (w * 6); // ugly monstrosity
else:
append U+FFFD to output string
if w > 0: // end of stream and we're still waiting for continuation bytes
append U+FFFD to output string
I have been working on another UTF-8 parser as a personal exercise, and while my implementation works quite well, and it rejects most malformed sequences (replacing them with U+FFFD), I can't seem to figure out how to implement rejection of overlong forms. Could anyone tell me how to do so?
Pseudocode:
let w = 0, // the number of continuation bytes pending
c = 0, // the currently being constructed codepoint
b, // the current byte from the source stream
valid(c) = (
(c < 0x110000) &&
((c & 0xFFFFF800) != 0xD800) &&
((c < 0xFDD0) || (c > 0xFDEF)) &&
((c & 0xFFFE) != 0xFFFE))
for each b:
if b < 0x80:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = 0
append U+b to output string
else if b < 0xc0:
if w == 0: // unwanted continuation byte
append U+FFFD to output string
else:
c |= (b & 0x3f) << (--w * 6)
if w == 0: // done
if valid(c):
append U+c to output string
else if b < 0xfe:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = (b < 0xe0) ? 1 :
(b < 0xf0) ? 2 :
(b < 0xf8) ? 3 :
(b < 0xfc) ? 4 : 5;
c = (b & ((1 << (6 - w)) - 1)) << (w * 6); // ugly monstrosity
else:
append U+FFFD to output string
if w > 0: // end of stream and we're still waiting for continuation bytes
append U+FFFD to output string
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您保存了所需的字节数(因此您保存了
w
初始值的第二个副本),您可以比较代码点的 UTF32 值(我认为您正在调用它 < code>c) 以及用于对其进行编码的字节数。你知道:(我希望我在左栏中做了正确的数学计算!十六进制数学不是我的强项:-))
作为旁注:我认为存在一些逻辑错误/格式错误。
如果 b <如果 w > 则为 0x80 0 如果 w = 0 会发生什么? (例如,如果您正在解码
A
)?当您发现非法代码点时,难道不应该重置c
吗?If you save the number of bytes you'll need (so you save a second copy of the initial value of
w
), you can compare the UTF32 value of the codepoint (I think you are calling itc
) with the number of bytes that were used to encode it. You know that:(and I hope I have done the right math on the left column! Hex math isn't my strong point :-) )
Just as a sidenote: I think there are some logic errors/formatting errors.
if b < 0x80 if w > 0
what happens if w = 0? (so for example if you are decodingA
)? And shouldn't you resetc
when you find an illegal codepoint?一旦获得解码后的字符,只需查看最高位集,您就可以知道如果正确编码,它应该有多少字节。
如果最高设置位的位置 <= 7,则 UTF-8 编码需要 1 个八位字节。
如果最高设置位的位置 <= 11,则 UTF-8 编码需要 2 个八位字节。
如果最高设置位的位置 <= 16,则 UTF-8 编码需要 3 个八位字节。
等等。
如果您保存原始
w
并将其与这些值进行比较,您将能够判断编码是否正确或过长。Once you have the decoded character, you can tell how many bytes it should have had if properly encoded just by looking at the highest bit set.
If the highest set bit's position is <= 7, the UTF-8 encoding requires 1 octet.
If the highest set bit's position is <= 11, the UTF-8 encoding requires 2 octets.
If the highest set bit's position is <= 16, the UTF-8 encoding requires 3 octets.
etc.
If you save the original
w
and compare it to these values, you'll be able to tell if the encoding was proper or overlong.我最初认为,如果在解码一个字节后的任何时间点,
w > > 0 && c == 0
,你的形式过长。然而,正如 Jan 指出的那样,事情比这更复杂。最简单的答案可能是有一个像 xanatos 只拒绝任何超过 4 个字节的内容:I had initially thought that if at any point in time after decoding a byte,
w > 0 && c == 0
, you have an overlong form. However, it's more complicated than that as Jan pointed out. The simplest answer is probably to have a table like xanatos has, only rejecting anything longer than 4 bytes: