为什么要用空终止字符串？或者：空终止与字符 + 长度存储

发布于 2024-08-01 13:40:25 字数 1132 浏览 13 评论 0原文

我正在用 C 语言编写一个语言解释器，并且我的 string 类型包含一个 length 属性，如下所示：

struct String
{
    char* characters;
    size_t length;
};

因此，我必须花费大量时间在我的解释器手动处理这种字符串，因为 C 不包含对其的内置支持。我考虑过切换到简单的以 null 结尾的字符串，只是为了符合底层 C，但似乎有很多理由不这样做：

如果您使用“length”而不是寻找 null，则边界检查是内置的。

您必须遍历整个字符串才能找到它的长度。

您必须做额外的事情来处理以空结尾的字符串中间的空字符。

以 Null 结尾的字符串很难处理 Unicode。

非空终止字符串可以保留更多，即“Hello, world”和“Hello”的字符可以存储在同一位置，只是长度不同。这不能用空终止字符串来完成。

字符串切片（注意：字符串在我的语言中是不可变的）。显然，第二个更慢（并且更容易出错：考虑为这两个函数添加 begin 和 end 的错误检查）。

struct String slice(struct String in, size_t begin, size_t end)
{
    struct String out;
    out.characters = in.characters + begin;
    out.length = end - begin;

    return out;
}

char* slice(char* in, size_t begin, size_t end)
{
    char* out = malloc(end - begin + 1);

    for(int i = 0; i < end - begin; i++)
        out[i] = in[i + begin];

    out[end - begin] = '\0';

    return out;
}

毕竟，我的思考不再是我是否应该使用空终止字符串：我正在思考为什么 C 使用它们！

所以我的问题是：我所缺少的空终止有什么好处吗？

原文

I'm writing a language interpreter in C, and my string type contains a length attribute, like so:

struct String
{
    char* characters;
    size_t length;
};

Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:

Bounds-checking is built-in if you use "length" instead of looking for a null.

You have to traverse the entire string to find its length.

You have to do extra stuff to handle a null character in the middle of a null-terminated string.

Null-terminated strings deal poorly with Unicode.

Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.

String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin and end to both functions).

struct String slice(struct String in, size_t begin, size_t end)
{
    struct String out;
    out.characters = in.characters + begin;
    out.length = end - begin;

    return out;
}

char* slice(char* in, size_t begin, size_t end)
{
    char* out = malloc(end - begin + 1);

    for(int i = 0; i < end - begin; i++)
        out[i] = in[i + begin];

    out[end - begin] = '\0';

    return out;
}

After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!

So my question is: are there any benefits to null-termination that I'm missing?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清醇 2024-08-08 13:40:25

通常的解决方案是两者兼而有之 - 保持长度并保持空终止符。这不需要太多额外的工作，意味着您随时可以将字符串传递给任何函数。

空终止字符串通常会降低性能，原因很明显，发现长度所需的时间取决于长度。从好的方面来说，它们是在 C 中表示字符串的标准方式，因此如果您想使用大多数 C 库，您别无选择，只能支持它们。

回复收藏 0 原文

ゝ偶尔ゞ 2024-08-08 13:40:25

以 null 结尾的字符串的一个优点是，如果您逐个字符地遍历字符串，则只需保留一个指针来寻址该字符串：

while (*s)
{
    *s = toupper(*s);
    s++;
}

而对于没有标记的字符串，您需要保留两位状态：要么是指针和索引：

while (i < s.length)
{
    s.data[i] = toupper(s.data[i]);
    i++;
}

……要么是当前指针和限制：

s_end = s + length;
while (s < s_end)
{
    *s = toupper(*s);
    s++;
}

当CPU寄存器是稀缺资源时（编译器在分配它们方面表现较差），这一点很重要。现在，没那么多了。

One advantage of nul-terminated strings is that if you are walking through a string character-by-character, you only need to keep a single pointer to address the string:

while (*s)
{
    *s = toupper(*s);
    s++;
}

whereas for strings without sentinels, you need to keep two bits of state around: either a pointer and index:

while (i < s.length)
{
    s.data[i] = toupper(s.data[i]);
    i++;
}

...or a current pointer and a limit:

s_end = s + length;
while (s < s_end)
{
    *s = toupper(*s);
    s++;
}

When CPU registers were a scarce resource (and compilers were worse at allocating them), this was important. Now, not so much.

回复收藏 0 原文

洛阳烟雨空心柳 2024-08-08 13:40:25

长度也有其问题。

长度需要额外的存储（现在不是这样的问题，但在 30 年前是一个重要因素）。
长度需要额外的存储（
每次更改字符串时，都必须更新长度，因此性能会全面降低。
对于以 NUL 结尾的字符串，您仍然可以使用长度或存储指向最后一个字符的指针，因此，如果您正在进行大量字符串操作，您仍然可以达到与 string-with-length 相同的性能。
以 NUL 结尾的字符串要简单得多 - NUL 终止符只是 strcat 等方法用来确定字符串结尾的约定。因此，您可以将它们存储在常规字符数组中，而不必使用结构。
以

回复收藏 0 原文

ゃ人海孤独症 2024-08-08 13:40:25

来自 Joel 的回归基础：

为什么 C 字符串会这样工作？这是因为发明了 UNIX 和 C 编程语言的 PDP-7 微处理器具有 ASCIZ 字符串类型。 ASCIZ 的意思是“末尾带有 Z（零）的 ASCII。”
这是存储字符串的唯一方法吗？不，事实上，这是存储字符串的最糟糕的方法之一。对于重要的程序、API、操作系统、类库，您应该像躲避瘟疫一样避免使用 ASCIZ 字符串。

回复收藏 0 原文

一笑百媚生 2024-08-08 13:40:25

一个好处是，使用空终止时，空终止字符串的任何尾部也是空终止字符串。如果您需要将以第 N 个字符开头的子字符串（假设没有缓冲区溢出）传递到某个字符串处理函数中 - 没问题，只需将偏移地址传递到那里即可。当以其他方式存储大小时，您需要构造一个新字符串。

回复收藏 0 原文

菊凝晚露 2024-08-08 13:40:25

有点离题，但是有一种比您描述的方式更有效的方法来处理长度前缀字符串。创建一个像这样的结构（在 C99 及更高版本中有效）：

struct String 
{
  size_t length;
  char characters[0];
}

这将创建一个在开头具有长度的结构，其中“字符”元素可用作 char*，就像使用当前结构一样。但不同之处在于，您只能在堆上为每个字符串分配一个项目，而不是两个。像这样分配你的字符串：

mystr = malloc(sizeof(String) + strlen(cstring))

例如 - 结构的长度（只是 size_t）加上足够的空间来在其后面放置实际的字符串。

如果你不想使用C99，你也可以用“charcharacters[1]”来做到这一点，并从要分配的字符串长度中减去1。

Slightly offtopic, but there's a more efficient way to do length-prefixed strings than the way you describe. Create a struct like this (valid in C99 and up):

struct String 
{
  size_t length;
  char characters[0];
}

This creates a struct that has the length at the start, with the 'characters' element usable as a char* just as you would with your current struct. The difference, however, is that you can allocate only a single item on the heap for each string, instead of two. Allocate your strings like this:

mystr = malloc(sizeof(String) + strlen(cstring))

Eg - the length of the struct (which is just the size_t) plus enough space to put the actual string after it.

If you don't want to use C99, you can also do this with "char characters[1]" and subtract 1 from the length of the string to allocate.

回复收藏 0 原文

虐人心 2024-08-08 13:40:25

只是抛出一些假设：

没有办法获得以空结尾的字符串的“错误”实现。然而，标准化结构可以具有特定于供应商的实现。
不需要任何结构。空终止字符串可以说是“内置”的，因为它是 char* 的特殊情况。

回复收藏 0 原文

怂人 2024-08-08 13:40:25

我认为主要原因是标准没有具体说明除 char 之外的任何类型的大小。但是 sizeof(char) = 1 这对于字符串大小来说绝对是不够的。

回复收藏 0 原文

紙鸢 2024-08-08 13:40:25

尽管在大多数情况下我更喜欢 array + len 方法，但使用 null 终止是有充分理由的。

采取32位系统。

存储 7 字节字符串
char * + size_t + 8 个字节 = 19 个字节

存储 7 字节空项字符串
字符 * + 8 = 16 字节。

空项数组不需要像字符串那样是不可变的。我可以很高兴地通过简单地放置一个空字符来截断 c 字符串。如果您编码，则需要创建一个新字符串，这涉及分配内存。

根据琴弦的使用情况，您的琴弦将永远无法与 C 琴弦的性能相匹配（而不是您的琴弦）。

回复收藏 0 原文

旧夏天 2024-08-08 13:40:25

您说的完全正确，0 终止是一种在类型检查和部分操作性能方面较差的方法。本页上的答案已经总结了它的起源和用途。

我喜欢 Delphi 存储字符串的方式。我相信它在（可变长度）字符串之前保留了长度/最大长度。这样，字符串可以以空字符结尾以实现兼容性。

我对你的机制的担忧：
- 附加指针
- 语言核心部分的不变性；通常字符串类型不是不可变的，所以如果你重新考虑它会很困难。您需要实现“更改时创建副本”机制
- 使用 malloc（效率很低，但可能只是为了方便而包含在这里？）

祝你好运；编写自己的解释器对于理解编程语言的语法和语法非常有教育意义！（至少，对我来说）

回复收藏 0 原文

~没有更多了~

关于作者

梦在深巷

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

为什么要用空终止字符串？或者：空终止与字符 + 长度存储

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（10）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

为什么要用空终止字符串？ 或者：空终止与字符 + 长度存储

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（10）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

为什么要用空终止字符串？或者：空终止与字符 + 长度存储

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。