为什么IsSingleByte Encoding的GetByteCount要进行计算

发布于 2024-10-31 16:14:49 字数 92 浏览 0 评论 0原文

我检查了 AsciiEncoding 的 GetByteCount 方法。它执行长计算而不是返回 String.Length。这对我来说完全没有任何意义。你知道为什么吗?

I’ve inspected AsciiEncoding's GetByteCount method. It does long calculations rather then returning String.Length. It doesn’t completely make any sense to me. Do you have an idea why?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

っ〆星空下的拥抱 2024-11-07 16:14:49

编辑:我刚刚尝试重现此内容,目前无法强制使用 ASCIIEncoding 来进行不同的替换。相反,我必须使用 Encoding.GetEncoding 来获取可变的。因此,对于 ASCIIEncoding,我同意...但对于 IsSingleByte 返回 true 的其他实现,您仍然会遇到以下潜在问题。


考虑尝试获取不只包含 ASCII 字符的字符串的字节数。编码必须考虑 EncoderFallback...它可以执行任意数量的操作,包括以不确定的量增加计数。

可以针对编码器后备是“默认”编码器仅用“?”替换非 ASCII 字符的情况进行优化。尽管。


进一步编辑:我只是试图将其与代理对混淆,希望它由一个问号表示。不幸的是没有:

string text = "x\ud800\udc00y";
Console.WriteLine(text.Length); // Prints 4
Console.WriteLine(Encoding.ASCII.GetByteCount(text)); // Still prints 4!

EDIT: I've just tried reproducing this, and I can't currently force an ASCIIEncoding instead to have a different replacement. Instead, I'd have to use Encoding.GetEncoding to get a mutable one. So for ASCIIEncoding, I agree... but for other implementations where IsSingleByte returns true, you'd still have the potential problem below.


Consider trying to get the byte count of a string which doesn't just contain ASCII characters. The encoding has to take the EncoderFallback into account... which could do any number of things, including increasing the count by an indeterminate amount.

It could be optimized for the case where the encoder fallback is a "default" one which just replaces non-ASCII characters with "?" though.


Further edit: I've just tried to confuse this with a surrogate pair, hoping that it would be represented by a single question mark. Unfortunately not:

string text = "x\ud800\udc00y";
Console.WriteLine(text.Length); // Prints 4
Console.WriteLine(Encoding.ASCII.GetByteCount(text)); // Still prints 4!
夏花。依旧 2024-11-07 16:14:49

有趣的是, mono 运行时不会t 似乎包括这种行为

// Get the number of bytes needed to encode a character buffer.
public override int GetByteCount (char[] chars, int index, int count)
{
    if (chars == null) {
        throw new ArgumentNullException ("chars");
    }
    if (index < 0 || index > chars.Length) {
        throw new ArgumentOutOfRangeException ("index", _("ArgRange_Array"));
    }
    if (count < 0 || count > (chars.Length - index)) {
        throw new ArgumentOutOfRangeException ("count", _("ArgRange_Array"));
    }
    return count;
}

// Convenience wrappers for "GetByteCount".
public override int GetByteCount (String chars)
{
    if (chars == null) {
        throw new ArgumentNullException ("chars");
    }
    return chars.Length;
}

并进一步向下

[CLSCompliantAttribute(false)]
[ComVisible (false)]
public unsafe override int GetByteCount (char *chars, int count)
{
    return count;
}

Interestingly, the mono runtime doesn't seem to include that behaviour:

// Get the number of bytes needed to encode a character buffer.
public override int GetByteCount (char[] chars, int index, int count)
{
    if (chars == null) {
        throw new ArgumentNullException ("chars");
    }
    if (index < 0 || index > chars.Length) {
        throw new ArgumentOutOfRangeException ("index", _("ArgRange_Array"));
    }
    if (count < 0 || count > (chars.Length - index)) {
        throw new ArgumentOutOfRangeException ("count", _("ArgRange_Array"));
    }
    return count;
}

// Convenience wrappers for "GetByteCount".
public override int GetByteCount (String chars)
{
    if (chars == null) {
        throw new ArgumentNullException ("chars");
    }
    return chars.Length;
}

and further down

[CLSCompliantAttribute(false)]
[ComVisible (false)]
public unsafe override int GetByteCount (char *chars, int count)
{
    return count;
}
肤浅与狂妄 2024-11-07 16:14:49

对于像 UTF8 这样的多字节字符编码,此方法很有意义,因为字符存储在 1 - 6 个字节中。我想,该方法也适用于像 ASCII 这样的固定大小编码,其中每个字符都用 7 位存储。然而,在实际实现中,“aaaaaaaa” 将是 8 个字节,因为 ASCII 中的字符存储在 1 个字节(8 位)中,因此 lenght hack 在最佳情况下可以工作。

以前版本的 .NET Framework 允许通过忽略第 8 位进行欺骗。当前版本已更改,以便在字节解码期间非 ASCII 代码点回退。

来源: MSDN

我将你的问题理解为:长度黑客是否存在最坏的情况?

        Encoding ae = Encoding.GetEncoding(
              "us-ascii",
              new EncoderReplacementFallback("[lol]"),
              new DecoderReplacementFallback("[you broke Me]"));

        Console.WriteLine(ae.GetByteCount("õäöü"));

这将返回20作为字符串“õäöü”包含 4 个字符,全部超出 "us-ascii" 字符集限制(U+0000U+007F。),因此之后编码器,文本将为“[lol][lol][lol][lol]”

For a multibyte character encoding like UTF8, this method makes sense, because characters are stored in with 1 - 6 bytes. I imagine, that method also applies for a fixed size encoding like ASCII, where every character is stored with 7 bits. In actual implementation however, "aaaaaaaa" would be 8 bytes, as characters in ASCII are stored in 1 byte (8 bits), so lenght hack would work in best case scenario.

Previous versions of .NET Framework allowed spoofing by ignoring the 8th bit. The current version has been changed so that non-ASCII code points fall back during the decoding of bytes.

Source: MSDN

I understand your question as : Does worst case scenario exist for lenght hack?

        Encoding ae = Encoding.GetEncoding(
              "us-ascii",
              new EncoderReplacementFallback("[lol]"),
              new DecoderReplacementFallback("[you broke Me]"));

        Console.WriteLine(ae.GetByteCount("õäöü"));

This will return 20 as string "õäöü" contains 4 characters, that all are off "us-ascii" character set limits ( U+0000 to U+007F.), so after encoder, the text will be "[lol][lol][lol][lol]".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文