.NET 字符串的内部格式是什么?

发布于 2024-07-25 04:50:39 字数 454 浏览 9 评论 0原文

我正在 C#.NET 中编写一些漂亮的字符串操作密集型代码,并对我记得不久前读过的一些 Joel Spolsky 文章感到好奇:

http://www.joelonsoftware.com/articles/fog0000000319.html
http://www.joelonsoftware.com/articles/Unicode.html

那么,如何.NET 能做到吗? 每个字符两个字节? 有一些 Unicode 字符^H^H^H^H^H 代码点需要更多。 长度是如何编码的?

I'm making some pretty string-manipulation-intensive code in C#.NET and got curious about some Joel Spolsky articles I remembered reading a while back:

http://www.joelonsoftware.com/articles/fog0000000319.html
http://www.joelonsoftware.com/articles/Unicode.html

So, how does .NET do it? Two bytes per char? There ARE some Unicode chars^H^H^H^H^H code points that need more than that. And how is the length encoded?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

素罗衫 2024-08-01 04:50:39

在 Jon Skeet 出现之前,这里有一个指向他的关于 C# 字符串的优秀博客的链接。

至少在当前实现中,字符串占用 20+(n/2)*4 个字节(向下舍入 n/2 的值),其中 n 是字符串中的字符数。 字符串类型的不同寻常之处在于对象本身的大小各不相同

Before Jon Skeet turns up here is a link to his excellent blog on strings in C#.

In the current implementation at least, strings take up 20+(n/2)*4 bytes (rounding the value of n/2 down), where n is the number of characters in the string. The string type is unusual in that the size of the object itself varies

一张白纸 2024-08-01 04:50:39

.NET 使用 UTF-16

来自 MSDN 上的 System.String

“a 中的每个 Unicode 字符字符串由 Unicode 标量值定义,也称为 Unicode 代码点或 Unicode 字符的序数(数字)值。每个代码点都使用 UTF-16 编码进行编码,并表示该编码的每个元素的数值。通过 Char 对象。”

.NET uses UTF-16.

From System.String on MSDN:

"Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object."

辞别 2024-08-01 04:50:39

String 对象非常复杂,无法提供一个简短的示例并将给定的文本编码为字符串,并将生成的内存内容显示为字节值序列。

String 对象将文本表示为 UTF-16 代码单元序列。 它是 System.Char 对象的顺序集合,每个对象对应一个 UTF-16 代码单元。
单个 Char 对象通常代表单个代码点。 一个代码点可能需要多个编码元素,即。 多个 Char 对象(补充代码点(或代理对)和字素)。 注意:UTF-16 是一种可变宽度编码。

字符串的长度作为 String 对象的属性存储在内存中。 注意:String 对象可以包含嵌入的空字符,这些字符算作字符串长度的一部分(与 C 和 C++ 不同,其中空字符表示字符串的结尾,因此不必另外存储长度)。 存储 Char 对象的内部字符数组实际上可以比字符串的长度长(分配策略的结果)。

如果您很难创建正确的编码来使用(因为您找不到任何名为 System.Text.Encoding.UTF16 的属性),则 UTF-16 实际上是 System.Text.Encoding.Unicode,如本示例中所使用:

string unicodeString = "pi stands for \u03a0";
byte[] encoded = System.Text.Encoding.Unicode.GetBytes(unicodeString);

构造函数 Encoding .Unicode,不带任何参数,实际上使用小端字节顺序创建一个 UnicodeEncoding 对象。 UnicodeEncoding 类(实现 UTF-16 编码)也能够处理大端(还支持字节顺序标记的处理)。 Intel 平台的本机字节顺序是小端字节序,因此 .NET(和 Windows)以这种格式存储 Unicode 字符串可能更有效。

The String object is pretty complicated to provide a short example and encode a given text into a string, showing the resulting memory content as a sequence of byte values.

A String object represents text as a sequence of UTF-16 code units. It is a sequential collection of System.Char objects, each of which corresponds to a UTF-16 code unit.
A single Char object usually represents a single code point. A code point might require more than one encoded element, ie. more than one Char object (supplementary code points (or surrogate pairs) and graphemes). Note: UTF-16 is a variable-width encoding.

The length of the string is stored in memory as a property of the String object. Note: a String object can include embedded null characters, which count as a part of the string's length (as opposed to C and C++, where a null character indicates the end of a string so the length does not have to be stored additionally). The internal character array, storing the Char objects, can actually be longer than the length of the string (resulting of the allocation strategy).

If you struggle to create the right encoding to work with (because you cannot find any property called System.Text.Encoding.UTF16), UTF-16 is actually System.Text.Encoding.Unicode, as used in this example:

string unicodeString = "pi stands for \u03a0";
byte[] encoded = System.Text.Encoding.Unicode.GetBytes(unicodeString);

The constructor Encoding.Unicode, without any parameters, actually creates a UnicodeEncoding object using the little endian byte order. The UnicodeEncoding class (that implements the UTF-16 encoding) is capable of handling big endian as well (also supports the handling of a byte order mark). The native byte order of the Intel platform is little endian, so it is probably more efficient for .NET (and Windows) to store Unicode strings in this format.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文