为什么IsSingleByte Encoding的GetByteCount要进行计算
我检查了 AsciiEncoding 的 GetByteCount 方法。它执行长计算而不是返回 String.Length。这对我来说完全没有任何意义。你知道为什么吗?
I’ve inspected AsciiEncoding's GetByteCount method. It does long calculations rather then returning String.Length. It doesn’t completely make any sense to me. Do you have an idea why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
编辑:我刚刚尝试重现此内容,目前无法强制使用 ASCIIEncoding 来进行不同的替换。相反,我必须使用 Encoding.GetEncoding 来获取可变的。因此,对于 ASCIIEncoding,我同意...但对于
IsSingleByte
返回 true 的其他实现,您仍然会遇到以下潜在问题。考虑尝试获取不只包含 ASCII 字符的字符串的字节数。编码必须考虑
EncoderFallback
...它可以执行任意数量的操作,包括以不确定的量增加计数。它可以针对编码器后备是“默认”编码器仅用“?”替换非 ASCII 字符的情况进行优化。尽管。
进一步编辑:我只是试图将其与代理对混淆,希望它由一个问号表示。不幸的是没有:
EDIT: I've just tried reproducing this, and I can't currently force an ASCIIEncoding instead to have a different replacement. Instead, I'd have to use Encoding.GetEncoding to get a mutable one. So for ASCIIEncoding, I agree... but for other implementations where
IsSingleByte
returns true, you'd still have the potential problem below.Consider trying to get the byte count of a string which doesn't just contain ASCII characters. The encoding has to take the
EncoderFallback
into account... which could do any number of things, including increasing the count by an indeterminate amount.It could be optimized for the case where the encoder fallback is a "default" one which just replaces non-ASCII characters with "?" though.
Further edit: I've just tried to confuse this with a surrogate pair, hoping that it would be represented by a single question mark. Unfortunately not:
有趣的是, mono 运行时不会t 似乎包括这种行为:
并进一步向下
Interestingly, the mono runtime doesn't seem to include that behaviour:
and further down
对于像 UTF8 这样的多字节字符编码,此方法很有意义,因为字符存储在 1 - 6 个字节中。我想,该方法也适用于像 ASCII 这样的固定大小编码,其中每个字符都用 7 位存储。然而,在实际实现中,
“aaaaaaaa”
将是 8 个字节,因为 ASCII 中的字符存储在 1 个字节(8 位)中,因此lenght hack
在最佳情况下可以工作。我将你的问题理解为:
长度黑客是否存在最坏的情况?
这将返回
20
作为字符串“õäöü”
包含 4 个字符,全部超出"us-ascii"
字符集限制(U+0000 到 U+007F。),因此之后编码器,文本将为“[lol][lol][lol][lol]”
。For a multibyte character encoding like UTF8, this method makes sense, because characters are stored in with 1 - 6 bytes. I imagine, that method also applies for a fixed size encoding like ASCII, where every character is stored with 7 bits. In actual implementation however,
"aaaaaaaa"
would be 8 bytes, as characters in ASCII are stored in 1 byte (8 bits), solenght hack
would work in best case scenario.I understand your question as :
Does worst case scenario exist for lenght hack?
This will return
20
as string"õäöü"
contains 4 characters, that all are off"us-ascii"
character set limits ( U+0000 to U+007F.), so after encoder, the text will be"[lol][lol][lol][lol]"
.