当前位置：文江博客话题详情

空终止字符串的基本原理是什么？

发布于 2024-10-07 07:59:09 字数 1454 浏览 11 评论 0原文

尽管我很喜欢 C 和 C++，但我还是忍不住对空终止字符串的选择感到摸不着头脑：

长度前缀（即 Pascal）字符串在 C 之前就已经存在
。长度前缀字符串通过允许恒定时间长度查找，使多种算法更快。
带长度前缀的字符串更难以导致缓冲区溢出错误。
即使在 32 位计算机上，如果允许字符串为可用内存大小，则长度前缀字符串仅比空终止字符串宽三个字节。在 16 位机器上，这是一个字节。在 64 位机器上，4GB 是合理的字符串长度限制，但即使您想将其扩展到机器字的大小，64 位机器通常也有足够的内存，使得额外的 7 个字节成为空参数。我知道最初的 C 标准是为极其糟糕的机器（就内存而言）编写的，但效率论点并不能让我信服。
几乎所有其他语言（即 Perl、Pascal、Python、Java、C# 等）都使用长度前缀字符串。这些语言通常在字符串操作基准测试中击败 C，因为它们处理字符串的效率更高。
C++ 使用 std::basic_string 模板对此进行了一些纠正，但期望以 null 结尾的字符串的纯字符数组仍然普遍存在。这也是不完美的，因为它需要堆分配。
以空结尾的字符串必须保留一个字符（即空），该字符不能存在于字符串中，而长度前缀的字符串可以包含嵌入的空值。

其中一些事情比 C 更晚才被发现，因此 C 不知道它们也是有道理的。然而，有几个早在 C 出现之前就已经很简单了。为什么选择以空结尾的字符串而不是明显优越的长度前缀？

编辑：由于有些人要求提供关于我上面的效率点的事实（并且不喜欢我已经提供的事实），它们源于以下几件事：

使用 null 终止的 Concat字符串需要 O(n + m) 时间复杂度。长度前缀通常只需要 O(m)。
使用空终止字符串的长度需要 O(n) 时间复杂度。长度前缀是 O(1)。
长度和连接是迄今为止最常见的字符串操作。在某些情况下，以 null 结尾的字符串可以更有效，但这种情况发生的频率要低得多。

从下面的答案来看，在某些情况下，以 null 结尾的字符串效率更高：

当您需要切断字符串的开头并需要将其传递给某些方法时。即使允许您销毁原始字符串，您也不能真正在恒定时间内使用长度前缀来完成此操作，因为长度前缀可能需要遵循对齐规则。
在某些情况下，如果您只是逐个字符地循环字符串，您也许可以保存 CPU 寄存器。请注意，这只适用于您没有动态分配字符串的情况（因为这样您就必须释放它，从而需要使用您保存的 CPU 寄存器来保存您最初从 malloc 和朋友那里获得的指针）。

以上都不像长度和连接那么常见。

下面的答案中还有一个断言：

您需要切断字符串的末尾

，但这一个是不正确的——对于以 null 结尾的字符串和长度前缀的字符串来说，时间是相同的。（空终止字符串只需在您想要新结尾的位置粘贴一个空值，长度前缀只需从前缀中减去。）

原文

As much as I love C and C++, I can't help but scratch my head at the choice of null terminated strings:

Length prefixed (i.e. Pascal) strings existed before C
Length prefixed strings make several algorithms faster by allowing constant time length lookup.
Length prefixed strings make it more difficult to cause buffer overrun errors.
Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here.
Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings.
C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation.
Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

Several of these things have come to light more recently than C, so it would make sense for C to not have known of them. However, several were plain well before C came to be. Why would null terminated strings have been chosen instead of the obviously superior length prefixing?

EDIT: Since some asked for facts (and didn't like the ones I already provided) on my efficiency point above, they stem from a few things:

Concat using null terminated strings requires O(n + m) time complexity. Length prefixing often require only O(m).
Length using null terminated strings requires O(n) time complexity. Length prefixing is O(1).
Length and concat are by far the most common string operations. There are several cases where null terminated strings can be more efficient, but these occur much less often.

From answers below, these are some cases where null terminated strings are more efficient:

When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules.
In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

None of the above are nearly as common as length and concat.

There's one more asserted in the answers below:

You need to cut off the end of the string

but this one is incorrect -- it's the same amount of time for null terminated and length prefixed strings. (Null terminated strings just stick a null where you want the new end to be, length prefixers just subtract from the prefix.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里兽 2024-10-14 07:59:09

来自马口

BCPL、B 或 C 都不支持
字符数据强在
语言;每个都对字符串处理得很多
就像整数向量和
通过一些补充一般规则
惯例。在 BCPL 和 B a 中
字符串文字表示的地址
一个静态区域初始化为
字符串的字符，打包成
细胞。在 BCPL 中，第一个打包字节
包含的字符数
字符串； B 中没有计数
字符串以 a 结尾
特殊字符，B 拼写的
<代码>*e。此更改已部分完成
以避免长度限制
由持有引起的字符串
计数 8 位或 9 位槽，以及
部分原因是维持计数
根据我们的经验，似乎较少
比使用终止符方便。

_{Dennis M Ritchie，C 语言的发展}

空终止字符串的基本原理是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（20）

鉴于下面的狂风暴雨：

in light of the raging squall below:

关于作者

相关话题

热门标签

推荐作者

佚名

今天

゛时过境迁

达拉崩吧

呆萌少年

孤者何惧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。