为什么要用空终止字符串? 或者:空终止与字符 + 长度存储

发布于 2024-08-01 13:40:25 字数 1132 浏览 13 评论 0原文

我正在用 C 语言编写一个语言解释器,并且我的 string 类型包含一个 length 属性,如下所示:

struct String
{
    char* characters;
    size_t length;
};

因此,我必须花费大量时间在我的解释器手动处理这种字符串,因为 C 不包含对其的内置支持。 我考虑过切换到简单的以 null 结尾的字符串,只是为了符合底层 C,但似乎有很多理由不这样做:

如果您使用“length”而不是寻找 null,则边界检查是内置的。

您必须遍历整个字符串才能找到它的长度。

您必须做额外的事情来处理以空结尾的字符串中间的空字符。

以 Null 结尾的字符串很难处理 Unicode。

非空终止字符串可以保留更多,即“Hello, world”和“Hello”的字符可以存储在同一位置,只是长度不同。 这不能用空终止字符串来完成。

字符串切片(注意:字符串在我的语言中是不可变的)。 显然,第二个更慢(并且更容易出错:考虑为这两个函数添加 beginend 的错误检查)。

struct String slice(struct String in, size_t begin, size_t end)
{
    struct String out;
    out.characters = in.characters + begin;
    out.length = end - begin;

    return out;
}

char* slice(char* in, size_t begin, size_t end)
{
    char* out = malloc(end - begin + 1);

    for(int i = 0; i < end - begin; i++)
        out[i] = in[i + begin];

    out[end - begin] = '\0';

    return out;
}

毕竟,我的思考不再是我是否应该使用空终止字符串:我正在思考为什么 C 使用它们!

所以我的问题是:我所缺少的空终止有什么好处吗?

I'm writing a language interpreter in C, and my string type contains a length attribute, like so:

struct String
{
    char* characters;
    size_t length;
};

Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:

Bounds-checking is built-in if you use "length" instead of looking for a null.

You have to traverse the entire string to find its length.

You have to do extra stuff to handle a null character in the middle of a null-terminated string.

Null-terminated strings deal poorly with Unicode.

Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.

String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin and end to both functions).

struct String slice(struct String in, size_t begin, size_t end)
{
    struct String out;
    out.characters = in.characters + begin;
    out.length = end - begin;

    return out;
}

char* slice(char* in, size_t begin, size_t end)
{
    char* out = malloc(end - begin + 1);

    for(int i = 0; i < end - begin; i++)
        out[i] = in[i + begin];

    out[end - begin] = '\0';

    return out;
}

After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!

So my question is: are there any benefits to null-termination that I'm missing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

清醇 2024-08-08 13:40:25

通常的解决方案是两者兼而有之 - 保持长度并保持空终止符。 这不需要太多额外的工作,意味着您随时可以将字符串传递给任何函数。

空终止字符串通常会降低性能,原因很明显,发现长度所需的时间取决于长度。 从好的方面来说,它们是在 C 中表示字符串的标准方式,因此如果您想使用大多数 C 库,您别无选择,只能支持它们。

The usual solution is to do both - keep the length and maintain the null terminator. It's not much extra work and means that you are always ready to pass the string to any function.

Null-terminated strings are often a drain on performance, for the obvious reason that the time taken to discover the length depends on the length. On the plus side, they are the standard way of representing strings in C, so you have little choice but to support them if you want to use most C libraries.

ゝ偶尔ゞ 2024-08-08 13:40:25

以 null 结尾的字符串的一个优点是,如果您逐个字符地遍历字符串,则只需保留一个指针来寻址该字符串:

while (*s)
{
    *s = toupper(*s);
    s++;
}

而对于没有标记的字符串,您需要保留两位状态:要么是指针和索引:

while (i < s.length)
{
    s.data[i] = toupper(s.data[i]);
    i++;
}

……要么是当前指针和限制:

s_end = s + length;
while (s < s_end)
{
    *s = toupper(*s);
    s++;
}

当CPU寄存器是稀缺资源时(编译器在分配它们方面表现较差),这一点很重要。 现在,没那么多了。

One advantage of nul-terminated strings is that if you are walking through a string character-by-character, you only need to keep a single pointer to address the string:

while (*s)
{
    *s = toupper(*s);
    s++;
}

whereas for strings without sentinels, you need to keep two bits of state around: either a pointer and index:

while (i < s.length)
{
    s.data[i] = toupper(s.data[i]);
    i++;
}

...or a current pointer and a limit:

s_end = s + length;
while (s < s_end)
{
    *s = toupper(*s);
    s++;
}

When CPU registers were a scarce resource (and compilers were worse at allocating them), this was important. Now, not so much.

洛阳烟雨空心柳 2024-08-08 13:40:25

长度也有其问题。

  • 长度需要额外的存储(现在不是这样的问题,但在 30 年前是一个重要因素)。

    长度需要额外的存储(

  • 每次更改字符串时,都必须更新长度,因此性能会全面降低。

  • 对于以 NUL 结尾的字符串,您仍然可以使用长度或存储指向最后一个字符的指针,因此,如果您正在进行大量字符串操作,您仍然可以达到与 string-with-length 相同的性能。

  • 以 NUL 结尾的字符串要简单得多 - NUL 终止符只是 strcat 等方法用来确定字符串结尾的约定。 因此,您可以将它们存储在常规字符数组中,而不必使用结构。

Lengths have their problems too.

  • The length takes extra storage (not such an issue now, but a big factor 30 years ago).

  • Every time you alter a string you have to update the length, so you get reduced performance across the board.

  • With a NUL-terminated string you can still use a length or store a pointer to the last character, so if you are doing lots of string manipulations, you can still equal the performance of string-with-length.

  • NUL-terminated strings are much simpler - The NUL terminator is just a convention used by methods like strcat to determine the end of the string. So you can store them in a regular char array rather than having to use a struct.

ゃ人海孤独症 2024-08-08 13:40:25

来自 Joel 的回归基础

为什么 C 字符串会这样工作? 这是因为发明了 UNIX 和 C 编程语言的 PDP-7 微处理器具有 ASCIZ 字符串类型。 ASCIZ 的意思是“末尾带有 Z(零)的 ASCII。”

这是存储字符串的唯一方法吗? 不,事实上,这是存储字符串的最糟糕的方法之一。 对于重要的程序、API、操作系统、类库,您应该像躲避瘟疫一样避免使用 ASCIZ 字符串。

From Joel's Back to Basics:

Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.

一笑百媚生 2024-08-08 13:40:25

一个好处是,使用空终止时,空终止字符串的任何尾部也是空终止字符串。 如果您需要将以第 N 个字符开头的子字符串(假设没有缓冲区溢出)传递到某个字符串处理函数中 - 没问题,只需将偏移地址传递到那里即可。 当以其他方式存储大小时,您需要构造一个新字符串。

One benefit is that with null-termination any tail of a null-terminated string is also a null-terminated string. If you need to pass a substring starting with Nth character (provided there's no buffer overrun) into some string-handling function - no problem, just pass the offseeted address there. When storing size in some other way you would need to construct a new string.

菊凝晚露 2024-08-08 13:40:25

有点离题,但是有一种比您描述的方式更有效的方法来处理长度前缀字符串。 创建一个像这样的结构(在 C99 及更高版本中有效):

struct String 
{
  size_t length;
  char characters[0];
}

这将创建一个在开头具有长度的结构,其中“字符”元素可用作 char*,就像使用当前结构一样。 但不同之处在于,您只能在堆上为每个字符串分配一个项目,而不是两个。 像这样分配你的字符串:

mystr = malloc(sizeof(String) + strlen(cstring))

例如 - 结构的长度(只是 size_t)加上足够的空间来在其后面放置实际的字符串。

如果你不想使用C99,你也可以用“charcharacters[1]”来做到这一点,并从要分配的字符串长度中减去1。

Slightly offtopic, but there's a more efficient way to do length-prefixed strings than the way you describe. Create a struct like this (valid in C99 and up):

struct String 
{
  size_t length;
  char characters[0];
}

This creates a struct that has the length at the start, with the 'characters' element usable as a char* just as you would with your current struct. The difference, however, is that you can allocate only a single item on the heap for each string, instead of two. Allocate your strings like this:

mystr = malloc(sizeof(String) + strlen(cstring))

Eg - the length of the struct (which is just the size_t) plus enough space to put the actual string after it.

If you don't want to use C99, you can also do this with "char characters[1]" and subtract 1 from the length of the string to allocate.

虐人心 2024-08-08 13:40:25

只是抛出一些假设:

  • 没有办法获得以空结尾的字符串的“错误”实现。 然而,标准化结构可以具有特定于供应商的实现。
  • 不需要任何结构。 空终止字符串可以说是“内置”的,因为它是 char* 的特殊情况。

Just throwing out some hypotheticals:

  • there's no way to get a "wrong" implementation of null terminated strings. A standardized struct however could have vendor-specific implementations.
  • no structs are required. Null terminated strings are "built-in" so to speak, by virtue of being a special case of a char*.
怂人 2024-08-08 13:40:25

我认为主要原因是标准没有具体说明除 char 之外的任何类型的大小。 但是 sizeof(char) = 1 这对于字符串大小来说绝对是不够的。

I think main reason is that standard says nothing concrete about size of any type other than char. But sizeof(char) = 1 and that is definitely not enough for string size.

紙鸢 2024-08-08 13:40:25

尽管在大多数情况下我更喜欢 array + len 方法,但使用 null 终止是有充分理由的。

采取32位系统。

存储 7 字节字符串
char * + size_t + 8 个字节 = 19 个字节

存储 7 字节空项字符串
字符 * + 8 = 16 字节。

空项数组不需要像字符串那样是不可变的。 我可以很高兴地通过简单地放置一个空字符来截断 c 字符串。 如果您编码,则需要创建一个新字符串,这涉及分配内存。

根据琴弦的使用情况,您的琴弦将永远无法与 C 琴弦的性能相匹配(而不是您的琴弦)。

Although I prefer the array + len method in most cases, there are valid reasons for using null-terminated.

Take a 32-bit system.

To store a 7 byte string
char * + size_t + 8 bytes = 19 bytes

To store a 7 byte null-term string
char * + 8 = 16 bytes.

null-term arrays don't need to be immutable like your strings do. I can happily truncate the c-string by simply places a null char. If you code, you would need to create a new string, which involves allocating memory.

Depending on the usage of the strings, your strings will never be able to match the performance possible with c-strings as opposed to your strings.

旧夏天 2024-08-08 13:40:25

您说的完全正确,0 终止是一种在类型检查和部分操作性能方面较差的方法。 本页上的答案已经总结了它的起源和用途。

我喜欢 Delphi 存储字符串的方式。 我相信它在(可变长度)字符串之前保留了长度/最大长度。 这样,字符串可以以空字符结尾以实现兼容性。

我对你的机制的担忧:
- 附加指针
- 语言核心部分的不变性; 通常字符串类型不是不可变的,所以如果你重新考虑它会很困难。 您需要实现“更改时创建副本”机制
- 使用 malloc(效率很低,但可能只是为了方便而包含在这里?)

祝你好运; 编写自己的解释器对于理解编程语言的语法和语法非常有教育意义! (至少,对我来说)

You're absolutely right that 0-termination is a method which is poor with respect to type checking and performance for part of the operations. The answers on this page already summarize the origins and uses for it.

I liked the way Delphi stored strings. I believe it maintains a length/maxlength in before the (variable length) string. This way the strings can be null-terminated for compatibility.

My concerns with your mechanism:
- additional pointer
- immutability si in the core parts of your language; normally string types are not immutable so if you ever reconsider than it'll be tough. You'd need to implement a 'create copy on change' mechanism
- use of malloc (hardly efficient, but may be included here just for ease?)

Good luck; writing your own interpreter can be very educational in understanding mainly the grammar and syntax of programming languages! (at least, it ws for me)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文