解析 UTF-8 时防止形式过长

发布于 2024-12-04 10:42:44 字数 1504 浏览 1 评论 0原文

作为个人练习,我一直在研究另一个 UTF-8 解析器,虽然我的实现工作得很好,并且它拒绝大多数格式错误的序列(用 U+FFFD 替换它们),但我似乎不知道如何实现拒绝超长的形式。谁能告诉我该怎么做?

伪代码:

let w = 0, // the number of continuation bytes pending
    c = 0, // the currently being constructed codepoint
    b,     // the current byte from the source stream
    valid(c) = (
        (c < 0x110000) &&
        ((c & 0xFFFFF800) != 0xD800) &&
        ((c < 0xFDD0) || (c > 0xFDEF)) &&
        ((c & 0xFFFE) != 0xFFFE))
for each b:
    if b < 0x80:
        if w > 0: // premature ending to multi-byte sequence
            append U+FFFD to output string
            w = 0
        append U+b to output string
    else if b < 0xc0:
        if w == 0: // unwanted continuation byte
            append U+FFFD to output string
        else:
            c |= (b & 0x3f) << (--w * 6)
            if w == 0: // done
                if valid(c):
                    append U+c to output string
    else if b < 0xfe:
        if w > 0: // premature ending to multi-byte sequence
            append U+FFFD to output string
        w = (b < 0xe0) ? 1 :
            (b < 0xf0) ? 2 :
            (b < 0xf8) ? 3 :
            (b < 0xfc) ? 4 : 5;
        c = (b & ((1 << (6 - w)) - 1)) << (w * 6); // ugly monstrosity
    else:
        append U+FFFD to output string
if w > 0: // end of stream and we're still waiting for continuation bytes
    append U+FFFD to output string

I have been working on another UTF-8 parser as a personal exercise, and while my implementation works quite well, and it rejects most malformed sequences (replacing them with U+FFFD), I can't seem to figure out how to implement rejection of overlong forms. Could anyone tell me how to do so?

Pseudocode:

let w = 0, // the number of continuation bytes pending
    c = 0, // the currently being constructed codepoint
    b,     // the current byte from the source stream
    valid(c) = (
        (c < 0x110000) &&
        ((c & 0xFFFFF800) != 0xD800) &&
        ((c < 0xFDD0) || (c > 0xFDEF)) &&
        ((c & 0xFFFE) != 0xFFFE))
for each b:
    if b < 0x80:
        if w > 0: // premature ending to multi-byte sequence
            append U+FFFD to output string
            w = 0
        append U+b to output string
    else if b < 0xc0:
        if w == 0: // unwanted continuation byte
            append U+FFFD to output string
        else:
            c |= (b & 0x3f) << (--w * 6)
            if w == 0: // done
                if valid(c):
                    append U+c to output string
    else if b < 0xfe:
        if w > 0: // premature ending to multi-byte sequence
            append U+FFFD to output string
        w = (b < 0xe0) ? 1 :
            (b < 0xf0) ? 2 :
            (b < 0xf8) ? 3 :
            (b < 0xfc) ? 4 : 5;
        c = (b & ((1 << (6 - w)) - 1)) << (w * 6); // ugly monstrosity
    else:
        append U+FFFD to output string
if w > 0: // end of stream and we're still waiting for continuation bytes
    append U+FFFD to output string

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜司空 2024-12-11 10:42:44

如果您保存了所需的字节数(因此您保存了 w 初始值的第二个副本),您可以比较代码点的 UTF32 值(我认为您正在调用它 < code>c) 以及用于对其进行编码的字节数。你知道:(

U+0000 - U+007F 1 byte
U+0080 - U+07FF 2 bytes
U+0800 - U+FFFF 3 bytes
U+10000 - U+1FFFFF 4 bytes
U+200000 - U+3FFFFFF 5 bytes
U+4000000 - U+7FFFFFFF 6 bytes

我希望我在左栏中做了正确的数学计算!十六进制数学不是我的强项:-))

作为旁注:我认为存在一些逻辑错误/格式错误。 如果 b <如果 w > 则为 0x80 0 如果 w = 0 会发生什么? (例如,如果您正在解码 A)?当您发现非法代码点时,难道不应该重置 c 吗?

If you save the number of bytes you'll need (so you save a second copy of the initial value of w), you can compare the UTF32 value of the codepoint (I think you are calling it c) with the number of bytes that were used to encode it. You know that:

U+0000 - U+007F 1 byte
U+0080 - U+07FF 2 bytes
U+0800 - U+FFFF 3 bytes
U+10000 - U+1FFFFF 4 bytes
U+200000 - U+3FFFFFF 5 bytes
U+4000000 - U+7FFFFFFF 6 bytes

(and I hope I have done the right math on the left column! Hex math isn't my strong point :-) )

Just as a sidenote: I think there are some logic errors/formatting errors. if b < 0x80 if w > 0 what happens if w = 0? (so for example if you are decoding A)? And shouldn't you reset c when you find an illegal codepoint?

瑶笙 2024-12-11 10:42:44

一旦获得解码后的字符,只需查看最高位集,您就可以知道如果正确编码,它应该有多少字节。

如果最高设置位的位置 <= 7,则 UTF-8 编码需要 1 个八位字节。
如果最高设置位的位置 <= 11,则 UTF-8 编码需要 2 个八位字节。
如果最高设置位的位置 <= 16,则 UTF-8 编码需要 3 个八位字节。
等等。

如果您保存原始 w 并将其与这些值进行比较,您将能够判断编码是否正确或过长。

Once you have the decoded character, you can tell how many bytes it should have had if properly encoded just by looking at the highest bit set.

If the highest set bit's position is <= 7, the UTF-8 encoding requires 1 octet.
If the highest set bit's position is <= 11, the UTF-8 encoding requires 2 octets.
If the highest set bit's position is <= 16, the UTF-8 encoding requires 3 octets.
etc.

If you save the original w and compare it to these values, you'll be able to tell if the encoding was proper or overlong.

撞了怀 2024-12-11 10:42:44

我最初认为,如果在解码一个字节后的任何时间点,w > > 0 && c == 0,你的形式过长。然而,正如 Jan 指出的那样,事情比这更复杂。最简单的答案可能是有一个像 xanatos 只拒绝任何超过 4 个字节的内容:

if c < 0x80 && len > 1 ||
   c < 0x800 && len > 2 ||
   c < 0x10000 && len > 3 ||
   len > 4:
 append U+FFFD to output string

I had initially thought that if at any point in time after decoding a byte, w > 0 && c == 0, you have an overlong form. However, it's more complicated than that as Jan pointed out. The simplest answer is probably to have a table like xanatos has, only rejecting anything longer than 4 bytes:

if c < 0x80 && len > 1 ||
   c < 0x800 && len > 2 ||
   c < 0x10000 && len > 3 ||
   len > 4:
 append U+FFFD to output string
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文