空终止字符串的基本原理是什么?

发布于 2024-10-07 07:59:09 字数 1454 浏览 11 评论 0原文

尽管我很喜欢 C 和 C++,但我还是忍不住对空终止字符串的选择感到摸不着头脑:

  • 长度前缀(即 Pascal)字符串在 C 之前就已经存在
  • 。长度前缀字符串通过允许恒定时间长度查找,使多种算法更快。
  • 带长度前缀的字符串更难以导致缓冲区溢出错误。
  • 即使在 32 位计算机上,如果允许字符串为可用内存大小,则长度前缀字符串仅比空终止字符串宽三个字节。在 16 位机器上,这是一个字节。在 64 位机器上,4GB 是合理的字符串长度限制,但即使您想将其扩展到机器字的大小,64 位机器通常也有足够的内存,使得额外的 7 个字节成为空参数。我知道最初的 C 标准是为极其糟糕的机器(就内存而言)编写的,但效率论点并不能让我信服。
  • 几乎所有其他语言(即 Perl、Pascal、Python、Java、C# 等)都使用长度前缀字符串。这些语言通常在字符串操作基准测试中击败 C,因为它们处理字符串的效率更高。
  • C++ 使用 std::basic_string 模板对此进行了一些纠正,但期望以 null 结尾的字符串的纯字符数组仍然普遍存在。这也是不完美的,因为它需要堆分配。
  • 以空结尾的字符串必须保留一个字符(即空),该字符不能存在于字符串中,而长度前缀的字符串可以包含嵌入的空值。

其中一些事情比 C 更晚才被发现,因此 C 不知道它们也是有道理的。然而,有几个早在 C 出现之前就已经很简单了。为什么选择以空结尾的字符串而不是明显优越的长度前缀?

编辑:由于有些人要求提供关于我上面的效率点的事实(并且不喜欢我已经提供的事实),它们源于以下几件事:

  • 使用 null 终止的 Concat字符串需要 O(n + m) 时间复杂度。长度前缀通常只需要 O(m)。
  • 使用空终止字符串的长度需要 O(n) 时间复杂度。长度前缀是 O(1)。
  • 长度和连接是迄今为止最常见的字符串操作。在某些情况下,以 null 结尾的字符串可以更有效,但这种情况发生的频率要低得多。

从下面的答案来看,在某些情况下,以 null 结尾的字符串效率更高:

  • 当您需要切断字符串的开头并需要将其传递给某些方法时。即使允许您销毁原始字符串,您也不能真正在恒定时间内使用长度前缀来完成此操作,因为长度前缀可能需要遵循对齐规则。
  • 在某些情况下,如果您只是逐个字符地循环字符串,您也许可以保存 CPU 寄存器。请注意,这只适用于您没有动态分配字符串的情况(因为这样您就必须释放它,从而需要使用您保存的 CPU 寄存器来保存您最初从 malloc 和朋友那里获得的指针)。

以上都不像长度和连接那么常见。

下面的答案中还有一个断言:

  • 您需要切断字符串的末尾

,但这一个是不正确的——对于以 null 结尾的字符串和长度前缀的字符串来说,时间是相同的。 (空终止字符串只需在您想要新结尾的位置粘贴一个空值,长度前缀只需从前缀中减去。)

As much as I love C and C++, I can't help but scratch my head at the choice of null terminated strings:

  • Length prefixed (i.e. Pascal) strings existed before C
  • Length prefixed strings make several algorithms faster by allowing constant time length lookup.
  • Length prefixed strings make it more difficult to cause buffer overrun errors.
  • Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here.
  • Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings.
  • C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation.
  • Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

Several of these things have come to light more recently than C, so it would make sense for C to not have known of them. However, several were plain well before C came to be. Why would null terminated strings have been chosen instead of the obviously superior length prefixing?

EDIT: Since some asked for facts (and didn't like the ones I already provided) on my efficiency point above, they stem from a few things:

  • Concat using null terminated strings requires O(n + m) time complexity. Length prefixing often require only O(m).
  • Length using null terminated strings requires O(n) time complexity. Length prefixing is O(1).
  • Length and concat are by far the most common string operations. There are several cases where null terminated strings can be more efficient, but these occur much less often.

From answers below, these are some cases where null terminated strings are more efficient:

  • When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules.
  • In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

None of the above are nearly as common as length and concat.

There's one more asserted in the answers below:

  • You need to cut off the end of the string

but this one is incorrect -- it's the same amount of time for null terminated and length prefixed strings. (Null terminated strings just stick a null where you want the new end to be, length prefixers just subtract from the prefix.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(20

梦里兽 2024-10-14 07:59:09

来自马口

BCPL、B 或 C 都不支持
字符数据强在
语言;每个都对字符串处理得很多
就像整数向量和
通过一些补充一般规则
惯例。在 BCPL 和 B a 中
字符串文字表示的地址
一个静态区域初始化为
字符串的字符,打包成
细胞。在 BCPL 中,第一个打包字节
包含的字符数
字符串; B 中没有计数
字符串以 a 结尾
特殊字符,B 拼写的
<代码>*e。此更改已部分完成
以避免长度限制
由持有引起的字符串
计数 8 位或 9 位槽,以及
部分原因是维持计数
根据我们的经验,似乎较少
比使用终止符方便。

Dennis M Ritchie,C 语言的发展

From the horse's mouth

None of BCPL, B, or C supports
character data strongly in the
language; each treats strings much
like vectors of integers and
supplements general rules by a few
conventions. In both BCPL and B a
string literal denotes the address of
a static area initialized with the
characters of the string, packed into
cells. In BCPL, the first packed byte
contains the number of characters in
the string; in B, there is no count
and strings are terminated by a
special character, which B spelled
*e. This change was made partially
to avoid the limitation on the length
of a string caused by holding the
count in an 8- or 9-bit slot, and
partly because maintaining the count
seemed, in our experience, less
convenient than using a terminator.

Dennis M Ritchie, Development of the C Language

野却迷人 2024-10-14 07:59:09

C 没有字符串作为语言的一部分。 C 中的“字符串”只是一个指向 char 的指针。所以也许你问错了问题。

“省略字符串类型的理由是什么”可能更相关。对此我要指出的是,C 不是面向对象的语言,只有基本值类型。字符串是一个更高级别的概念,必须通过以某种方式组合其他类型的值来实现。 C 处于较低的抽象级别。

鉴于下面的狂风暴雨:

我只是想指出,我并不是想说这是一个愚蠢或糟糕的问题,或者表示字符串的 C 方式是最好的选择。我试图澄清,如果考虑到 C 没有区分字符串作为数据类型和字节数组的机制,那么这个问题会更简洁。考虑到当今计算机的处理和内存能力,这是最佳选择吗?可能不会。但事后诸葛亮总是 20/20 之类的:)

C doesn't have a string as part of the language. A 'string' in C is just a pointer to char. So maybe you're asking the wrong question.

"What's the rationale for leaving out a string type" might be more relevant. To that I would point out that C is not an object oriented language and only has basic value types. A string is a higher level concept that has to be implemented by in some way combining values of other types. C is at a lower level of abstraction.

in light of the raging squall below:

I just want to point out that I'm not trying to say this is a stupid or bad question, or that the C way of representing strings is the best choice. I'm trying to clarify that the question would be more succinctly put if you take into account the fact that C has no mechanism for differentiating a string as a datatype from a byte array. Is this the best choice in light of the processing and memory power of todays computers? Probably not. But hindsight is always 20/20 and all that :)

未央 2024-10-14 07:59:09

这个问题是作为长度前缀字符串(LPS)零终止字符串(SZ)问题提出的,但主要暴露了长度前缀字符串的好处。这似乎让人不知所措,但说实话,我们还应该考虑 LPS 的缺点和 SZ 的优点。

据我了解,这个问题甚至可能被理解为一种带有偏见的方式来询问“零终止字符串的优点是什么?”。

零终止字符串的优点(我明白了):

  • 非常简单,不需要引入语言中的新概念,char
    数组/字符指针可以做到。
  • 核心语言只包含最少的语法糖来转换
    双引号到 a 之间的内容
    一堆字符(实际上是一堆
    字节)。在某些情况下可以使用
    完全初始化事物
    与文字无关。例如xpm
    图像文件格式是有效的 C 源代码
    包含编码为的图像数据
    细绳。
  • 顺便说一句,您可以在字符串文字中放置一个零,编译器会
    只需在文字末尾添加另一个:“this\0is\0valid\0C”
    它是一个字符串吗?还是四弦?或者一堆字节......
  • 平面实现,没有隐藏的间接寻址,没有隐藏的整数。
  • 不涉及隐藏的内存分配(好吧,一些臭名昭著的非
    标准函数,如 strdup
    执行分配,但这主要是
    问题的根源)。
  • 对于小型或大型硬件没有具体问题(想象一下负担
    管理 8 上的 32 位前缀长度
    位微控制器,或
    限制字符串大小的限制
    小于 256 字节,这是我很久以前使用 Turbo Pascal 时实际遇到的问题)。
  • 字符串操作的实现只是少数
    非常简单的库函数,
  • 对于字符串的主要使用有效:常量文本读取
    从已知的开始顺序
    (主要是给用户的消息)。
  • 终止零甚至不是强制性的,所有必要的工具
    像一堆一样操纵字符
    字节可用。表演时
    C 中的数组初始化,您可以
    甚至避免 NUL 终止符。只是
    设置正确的尺寸。 字符a[3] =
    "foo";
    是有效的 C(不是 C++)并且
    不会在 a 中添加最后一个零。
  • 与unix观点“一切皆文件”一致,包括
    没有内在长度的“文件”
    如标准输入、标准输出。您应该记住开放读取和写入原语是实现的
    处于非常低的水平。它们不是库调用,而是系统调用。并且使用相同的API
    对于二进制或文本文件。文件读取原语获取缓冲区地址和大小并返回
    新尺寸。并且可以使用字符串作为缓冲区来写入。使用另一种字符串
    表示意味着您不能轻松地使用文字字符串作为输出缓冲区,或者
    当将其转换为 char* 时,您必须使其具有非常奇怪的行为。即
    不返回字符串的地址,而是返回实际数据。
  • 非常容易操作从文件中读取的文本数据,无需复制无用的缓冲区,
    只需在正确的位置插入零(好吧,对于现代 C 来说并非如此,因为双引号字符串是 const char 数组,现在通常保存在不可修改的数据段中)。
  • 在前面添加一些无论大小的 int 值都意味着对齐问题。最初的
    长度应该对齐,但是没有理由对字符数据(和
    再次强调,当将字符串视为一堆时,强制对齐字符串会带来问题。
    字节)。
  • 常量文字字符串 (sizeof) 的长度在编译时已知。那么为什么会
    有人想将其存储在内存中并将其添加到实际数据之前吗?
  • 在某种程度上,C 的做法(几乎)与其他人一样,字符串被视为 char 数组。由于 C 不管理数组长度,因此字符串也不管理逻辑长度。唯一令人惊讶的是最后添加了 0 项,但这只是在双引号之间键入字符串时处于核心语言级别。用户可以完美地调用传递长度的字符串操作函数,甚至可以使用普通的 memcopy 来代替。 SZ只是一个设施。在大多数其他语言中,数组长度是受管理的,逻辑上这与字符串相同。
  • 在现代,无论如何 1 字节字符集是不够的,您经常必须处理编码的 unicode 字符串,其中字符数与字节数非常不同。这意味着用户可能需要的不仅仅是“尺寸”,还需要其他信息。对于这些其他有用的信息,保持长度没有任何用处(特别是没有自然的地方来存储它们)。

也就是说,在标准 C 字符串确实效率低下的罕见情况下,无需抱怨。库可用。如果我遵循这种趋势,我应该抱怨标准 C 不包含任何正则表达式支持函数...但实际上每个人都知道这不是一个真正的问题,因为有可用于此目的的库。因此,当需要字符串操作效率时,为什么不使用像 bstring 这样的库呢?或者甚至是 C++ 字符串?

编辑:我最近查看了D strings。有趣的是,所选择的解决方案既不是大小前缀,也不是零终止。与 C 中一样,用双引号括起来的文字字符串只是不可变 char 数组的简写,并且该语言也有一个 string 关键字,表示该含义(不可变 char 数组)。

但D数组比C数组丰富得多。对于静态数组,长度在运行时已知,因此无需存储长度。编译器在编译时就有它。对于动态数组,长度是可用的,但 D 文档没有说明它的保存位置。据我们所知,编译器可以选择将其保存在某个寄存器中,或者保存在远离字符数据的某个变量中。

在普通的 char 数组或非文字字符串上,没有最后的零,因此如果程序员想要从 D 调用某些 C 函数,则必须将其自己放置。在文字字符串的特殊情况下,但是 D 编译器仍然在结尾处放置一个零。每个字符串的末尾(以允许轻松转换为 C 字符串,以便更容易调用 C 函数?),但这个零不是字符串的一部分(D 不将其计入字符串大小)。

唯一让我有些失望的是字符串应该是 utf-8,但即使使用多字节字符,长度显然仍然返回多个字节(至少在我的编译器 gdc 上是这样)。我不清楚这是编译器错误还是故意的。 (好吧,我可能已经发现发生了什么。要对 D 编译器说你的源代码使用 utf-8,你必须在开头放置一些愚蠢的字节顺序标记。我写得很愚蠢,因为我知道没有编辑器这样做,特别是对于 UTF- 8 应该与 ASCII 兼容)。

The question is asked as a Length Prefixed Strings (LPS) vs zero terminated strings (SZ) thing, but mostly expose benefits of length prefixed strings. That may seem overwhelming, but to be honest we should also consider drawbacks of LPS and advantages of SZ.

As I understand it, the question may even be understood as a biased way to ask "what are the advantages of Zero Terminated Strings ?".

Advantages (I see) of Zero Terminated Strings:

  • very simple, no need to introduce new concepts in language, char
    arrays/char pointers can do.
  • the core language just include minimal syntaxic sugar to convert
    something between double quotes to a
    bunch of chars (really a bunch of
    bytes). In some cases it can be used
    to initialize things completely
    unrelated with text. For instance xpm
    image file format is a valid C source
    that contains image data encoded as a
    string.
  • by the way, you can put a zero in a string literal, the compiler will
    just also add another one at the end of the literal: "this\0is\0valid\0C".
    Is it a string ? or four strings ? Or a bunch of bytes...
  • flat implementation, no hidden indirection, no hidden integer.
  • no hidden memory allocation involved (well, some infamous non
    standard functions like strdup
    perform allocation, but that's mostly
    a source of problem).
  • no specific issue for small or large hardware (imagine the burden to
    manage 32 bits prefix length on 8
    bits microcontrollers, or the
    restrictions of limiting string size
    to less than 256 bytes, that was a problem I actually had with Turbo Pascal eons ago).
  • implementation of string manipulation is just a handful of
    very simple library function
  • efficient for the main use of strings : constant text read
    sequentially from a known start
    (mostly messages to the user).
  • the terminating zero is not even mandatory, all necessary tools
    to manipulate chars like a bunch of
    bytes are available. When performing
    array initialisation in C, you can
    even avoid the NUL terminator. Just
    set the right size. char a[3] =
    "foo";
    is valid C (not C++) and
    won't put a final zero in a.
  • coherent with the unix point of view "everything is file", including
    "files" that have no intrinsic length
    like stdin, stdout. You should remember that open read and write primitives are implemented
    at a very low level. They are not library calls, but system calls. And the same API is used
    for binary or text files. File reading primitives get a buffer address and a size and return
    the new size. And you can use strings as the buffer to write. Using another kind of string
    representation would imply you can't easily use a literal string as the buffer to output, or
    you would have to make it have a very strange behavior when casting it to char*. Namely
    not to return the address of the string, but instead to return the actual data.
  • very easy to manipulate text data read from a file in-place, without useless copy of buffer,
    just insert zeroes at the right places (well, not really with modern C as double quoted strings are const char arrays nowaday usually kept in non modifiable data segment).
  • prepending some int values of whatever size would implies alignment issues. The initial
    length should be aligned, but there is no reason to do that for the characters datas (and
    again, forcing alignment of strings would imply problems when treating them as a bunch of
    bytes).
  • length is known at compile time for constant literal strings (sizeof). So why would
    anyone want to store it in memory prepending it to actual data ?
  • in a way C is doing as (nearly) everyone else, strings are viewed as arrays of char. As array length is not managed by C, it is logical length is not managed either for strings. The only surprising thing is that 0 item added at the end, but that's just at core language level when typing a string between double quotes. Users can perfectly call string manipulation functions passing length, or even use plain memcopy instead. SZ are just a facility. In most other languages array length is managed, it's logical that is the same for strings.
  • in modern times anyway 1 byte character sets are not enough and you often have to deal with encoded unicode strings where the number of characters is very different of the number of bytes. It implies that users will probably want more than "just the size", but also other informations. Keeping length give use nothing (particularly no natural place to store them) regarding these other useful pieces of information.

That said, no need to complain in the rare case where standard C strings are indeed inefficient. Libs are available. If I followed that trend, I should complain that standard C does not include any regex support functions... but really everybody knows it's not a real problem as there is libraries available for that purpose. So when string manipulation efficiency is wanted, why not use a library like bstring ? Or even C++ strings ?

EDIT: I recently had a look to D strings. It is interesting enough to see that the solution choosed is neither a size prefix, nor zero termination. As in C, literal strings enclosed in double quotes are just short hand for immutable char arrays, and the language also has a string keyword meaning that (immutable char array).

But D arrays are much richer than C arrays. In the case of static arrays length is known at run-time so there is no need to store the length. Compiler has it at compile time. In the case of dynamic arrays, length is available but D documentation does not state where it is kept. For all we know, compiler could choose to keep it in some register, or in some variable stored far away from the characters data.

On normal char arrays or non literal strings there is no final zero, hence programmer has to put it itself if he wants to call some C function from D. In the particular case of literal strings, however the D compiler still put a zero at the end of each strings (to allow easy cast to C strings to make easier calling C function ?), but this zero is not part of the string (D does not count it in string size).

The only thing that disappointed me somewhat is that strings are supposed to be utf-8, but length apparently still returns a number of bytes (at least it's true on my compiler gdc) even when using multi-byte chars. It is unclear to me if it's a compiler bug or by purpose. (OK, I probably have found out what happened. To say to D compiler your source use utf-8 you have to put some stupid byte order mark at beginning. I write stupid because I know of not editor doing that, especially for UTF-8 that is supposed to be ASCII compatible).

昵称有卵用 2024-10-14 07:59:09

我认为,它有历史原因,并在维基百科中找到了这个

当时的 C(以及当时的语言)
它源自)开发,
内存极其有限,因此使用
只需一个字节的开销来存储
绳子的长度很有吸引力。这
当时唯一流行的选择,
通常称为“帕斯卡弦”
(虽然也被早期版本使用
BASIC),使用前导字节来存储
字符串的长度。这允许
包含 NUL 的字符串并制作
求长度只需要一个
内存访问(O(1)(常数)时间)。
但一个字节限制长度为255。
这个长度限制要大得多
比问题更具有限制性
C字符串,所以一般的C字符串
获胜了。

I think, it has historical reasons and found this in wikipedia:

At the time C (and the languages that
it was derived from) were developed,
memory was extremely limited, so using
only one byte of overhead to store the
length of a string was attractive. The
only popular alternative at that time,
usually called a "Pascal string"
(though also used by early versions of
BASIC), used a leading byte to store
the length of the string. This allows
the string to contain NUL and made
finding the length need only one
memory access (O(1) (constant) time).
But one byte limits the length to 255.
This length limitation was far more
restrictive than the problems with the
C string, so the C string in general
won out.

愛上了 2024-10-14 07:59:09

卡拉维拉正确,但由于人们似乎不明白他的观点,我将提供一些代码示例。

首先,让我们考虑一下 C 是什么:一种简单的语言,所有代码都可以直接翻译成机器语言。所有类型都适合寄存器和堆栈,并且它不需要操作系统或大型运行时库来运行,因为它的目的是写入这些东西(任务是非常适合,考虑到今天甚至还没有可能的竞争对手)。

如果 C 有一个 string 类型,例如 intchar,那么它将是一种不适合寄存器或堆栈的类型,并且需要以任何方式处理内存分配(及其所有支持基础设施)。所有这些都违背了 C 的基本原则。

所以,C 中的字符串是:

char s*;

那么,让我们假设这是有长度前缀的。让我们编写代码来连接两个字符串:

char* concat(char* s1, char* s2)
{
    /* What? What is the type of the length of the string? */
    int l1 = *(int*) s1;
    /* How much? How much must I skip? */
    char *s1s = s1 + sizeof(int);
    int l2 = *(int*) s2;
    char *s2s = s2 + sizeof(int);
    int l3 = l1 + l2;
    char *s3 = (char*) malloc(l3 + sizeof(int));
    char *s3s = s3 + sizeof(int);
    memcpy(s3s, s1s, l1);
    memcpy(s3s + l1, s2s, l2);
    *(int*) s3 = l3;
    return s3;
}

另一种选择是使用结构体来定义字符串:

struct {
  int len; /* cannot be left implementation-defined */
  char* buf;
}

此时,所有字符串操作都需要进行两次分配,这在实践中意味着您将通过一个库来对其进行任何处理。

有趣的是...C 中确实存在这样的结构!它们只是不用于向用户处理日常显示消息。

所以,这就是 Calavera 的观点:C 中没有字符串类型。要使用它做任何事情,您必须获取一个指针并将其解码为指向两种不同类型的指针,然后它变得与字符串的大小非常相关,而不能仅仅保留为“实现定义”。

现在,C 可以以任何方式处理内存,并且库中的 mem 函数(甚至在 中!)提供将内存作为一对指针和大小来处理所需的所有工具。 C 中所谓的“字符串”只是为了一个目的而创建的:在编写用于文本终端的操作系统的上下文中显示消息。并且,为此,空终止就足够了。

Calavera is right, but as people don't seem to get his point, I'll provide some code examples.

First, let's consider what C is: a simple language, where all code has a pretty direct translation into machine language. All types fit into registers and on the stack, and it doesn't require an operating system or a big run-time library to run, since it were meant to write these things (a task to which is superbly well-suited, considering there isn't even a likely competitor to this day).

If C had a string type, like int or char, it would be a type which didn't fit in a register or in the stack, and would require memory allocation (with all its supporting infrastructure) to be handled in any way. All of which go against the basic tenets of C.

So, a string in C is:

char s*;

So, let's assume then that this were length-prefixed. Let's write the code to concatenate two strings:

char* concat(char* s1, char* s2)
{
    /* What? What is the type of the length of the string? */
    int l1 = *(int*) s1;
    /* How much? How much must I skip? */
    char *s1s = s1 + sizeof(int);
    int l2 = *(int*) s2;
    char *s2s = s2 + sizeof(int);
    int l3 = l1 + l2;
    char *s3 = (char*) malloc(l3 + sizeof(int));
    char *s3s = s3 + sizeof(int);
    memcpy(s3s, s1s, l1);
    memcpy(s3s + l1, s2s, l2);
    *(int*) s3 = l3;
    return s3;
}

Another alternative would be using a struct to define a string:

struct {
  int len; /* cannot be left implementation-defined */
  char* buf;
}

At this point, all string manipulation would require two allocations to be made, which, in practice, means you'd go through a library to do any handling of it.

The funny thing is... structs like that do exist in C! They are just not used for your day-to-day displaying messages to the user handling.

So, here is the point Calavera is making: there is no string type in C. To do anything with it, you'd have to take a pointer and decode it as a pointer to two different types, and then it becomes very relevant what is the size of a string, and cannot just be left as "implementation defined".

Now, C can handle memory in anyway, and the mem functions in the library (in <string.h>, even!) provide all the tooling you need to handle memory as a pair of pointer and size. The so-called "strings" in C were created for just one purpose: showing messages in the context of writting an operating system intended for text terminals. And, for that, null termination is enough.

雪落纷纷 2024-10-14 07:59:09

显然,为了性能和安全性,您需要在使用字符串时保留字符串的长度,而不是重复执行 strlen 或等效操作。然而,将长度存储在字符串内容之前的固定位置是一个非常糟糕的设计。正如 Jörgen 在 Sanjit 答案的评论中指出的那样,它不允许将字符串的尾部视为字符串,例如,这会产生许多常见操作,例如 path_to_filenamefilename_to_extension如果不分配新内存(并可能导致失败和错误处理),这是不可能的。当然,还有一个问题是,没有人能就字符串长度字段应占用多少字节达成一致(大量糟糕的“Pascal 字符串”语言使用 16 位字段甚至 24 位字段,这会妨碍长字符串的处理)。

C 让程序员选择是否/在哪里/如何存储长度的设计更加灵活和强大。当然,程序员必须很聪明。 C 通过崩溃、停止或让你的敌人 root 的程序来惩罚愚蠢的行为。

Obviously for performance and safety, you'll want to keep the length of a string while you're working with it rather than repeatedly performing strlen or the equivalent on it. However, storing the length in a fixed location just before the string contents is an incredibly bad design. As Jörgen pointed out in the comments on Sanjit's answer, it precludes treating the tail of a string as a string, which for example makes a lot of common operations like path_to_filename or filename_to_extension impossible without allocating new memory (and incurring the possibility of failure and error handling). And then of course there's the issue that nobody can agree how many bytes the string length field should occupy (plenty of bad "Pascal string" languages used 16-bit fields or even 24-bit fields which preclude processing of long strings).

C's design of letting the programmer choose if/where/how to store the length is much more flexible and powerful. But of course the programmer has to be smart. C punishes stupidity with programs that crash, grind to a halt, or give your enemies root.

谁把谁当真 2024-10-14 07:59:09

考虑到任何语言的汇编语言的惰性、寄存器节俭和可移植性,尤其是 C,它比汇编高出一步(因此继承了大量汇编遗留代码)。
您会同意,因为 null char 在 ASCII 时代毫无用处,它(可能与 EOF 控制 char 一样好)。

让我们看看伪代码

function readString(string) // 1 parameter: 1 register or 1 stact entries
    pointer=addressOf(string) 
    while(string[pointer]!=CONTROL_CHAR) do
        read(string[pointer])
        increment pointer

总共 1 个寄存器用

例 2

 function readString(length,string) // 2 parameters: 2 register used or 2 stack entries
     pointer=addressOf(string) 
     while(length>0) do 
         read(string[pointer])
         increment pointer
         decrement length

总共 2 个寄存器使用

当时这可能看起来短视,但考虑到代码和寄存器的节俭(当时是高级的,当你知道的时候,他们使用打孔卡) 。因此,速度更快(当处理器速度可以以 kHz 为单位计算时),这个“Hack”非常好,并且可以轻松移植到无寄存器处理器。

为了论证,我将实现 2 个常见的字符串操作,

stringLength(string)
     pointer=addressOf(string)
     while(string[pointer]!=CONTROL_CHAR) do
         increment pointer
     return pointer-addressOf(string)

复杂度为 O(n),其中在大多数情况下 PASCAL 字符串为 O(1),因为字符串的长度被预先添加到字符串结构中(这也意味着此操作将具有提前进行)。

concatString(string1,string2)
     length1=stringLength(string1)
     length2=stringLength(string2)
     string3=allocate(string1+string2)
     pointer1=addressOf(string1)
     pointer3=addressOf(string3)
     while(string1[pointer1]!=CONTROL_CHAR) do
         string3[pointer3]=string1[pointer1]
         increment pointer3
         increment pointer1
     pointer2=addressOf(string2)
     while(string2[pointer2]!=CONTROL_CHAR) do
         string3[pointer3]=string2[pointer2]
         increment pointer3
         increment pointer1
     return string3

复杂性 O(n) 和前置字符串长度不会改变操作的复杂性,但我承认这会减少 3 倍的时间。

另一方面,如果您使用 PASCAL 字符串,则必须重新设计 API 以考虑寄存器长度和位字节顺序,PASCAL 字符串具有 255 个字符 (0xFF) 的众所周知的限制,因为长度存储在 1 个字节(8 位)中),如果您想要更长的字符串(16 位 -> 任何值),则必须考虑代码一层中的架构,这意味着如果您想要更长的字符串,在大多数情况下,字符串 API 不兼容。

示例:

一个文件是在 8 位计算机上使用前置字符串 api 编写的,然后必须在 32 位计算机上读取,惰性程序会做什么考虑您的 4 个字节是字符串的长度,然后分配该批次然后尝试读取那么多字节的内存。
另一种情况是 PPC 32 字节字符串读取(小端)到 x86(大端),当然,如果您不知道一个是由另一个写入的,则会出现麻烦。
1 字节长度 (0x00000001) 将变为 16777216 (0x0100000),即读取 1 字节字符串时需要 16 MB。
当然,你会说人们应该就一个标准达成一致,但即使是 16 位 unicode 也有小端和大端。

当然,C 也会有它的问题,但是受这里提出的问题影响很小。

Lazyness, register frugality and portability considering the assembly gut of any language, especially C which is one step above assembly (thus inheriting a lot of assembly legacy code).
You would agree as a null char would be useless in those ASCII days, it (and probably as good as an EOF control char ).

let's see in pseudo code

function readString(string) // 1 parameter: 1 register or 1 stact entries
    pointer=addressOf(string) 
    while(string[pointer]!=CONTROL_CHAR) do
        read(string[pointer])
        increment pointer

total 1 register use

case 2

 function readString(length,string) // 2 parameters: 2 register used or 2 stack entries
     pointer=addressOf(string) 
     while(length>0) do 
         read(string[pointer])
         increment pointer
         decrement length

total 2 register used

That might seem shortsighted at that time, but considering the frugality in code and register ( which were PREMIUM at that time, the time when you know, they use punch card ). Thus being faster ( when processor speed could be counted in kHz), this "Hack" was pretty darn good and portable to register-less processor with ease.

For argument sake I will implement 2 common string operation

stringLength(string)
     pointer=addressOf(string)
     while(string[pointer]!=CONTROL_CHAR) do
         increment pointer
     return pointer-addressOf(string)

complexity O(n) where in most case PASCAL string is O(1) because the length of the string is pre-pended to the string structure (that would also mean that this operation would have to be carried in an earlier stage).

concatString(string1,string2)
     length1=stringLength(string1)
     length2=stringLength(string2)
     string3=allocate(string1+string2)
     pointer1=addressOf(string1)
     pointer3=addressOf(string3)
     while(string1[pointer1]!=CONTROL_CHAR) do
         string3[pointer3]=string1[pointer1]
         increment pointer3
         increment pointer1
     pointer2=addressOf(string2)
     while(string2[pointer2]!=CONTROL_CHAR) do
         string3[pointer3]=string2[pointer2]
         increment pointer3
         increment pointer1
     return string3

complexity O(n) and prepending the string length wouldn't change the complexity of the operation, while I admit it would take 3 time less time.

On another hand, if you use PASCAL string you would have to redesign your API for taking in account register length and bit-endianness, PASCAL string got the well known limitation of 255 char (0xFF) beacause the length was stored in 1 byte (8bits), and it you wanted a longer string (16bits->anything) you would have to take in account the architecture in one layer of your code, that would mean in most case incompatible string APIs if you wanted longer string.

Example:

One file was written with your prepended string api on an 8 bit computer and then would have to be read on say a 32 bit computer, what would the lazy program do considers that your 4bytes are the length of the string then allocate that lot of memory then attempt to read that many bytes.
Another case would be PPC 32 byte string read(little endian) onto a x86 (big endian), of course if you don't know that one is written by the other there would be trouble.
1 byte length (0x00000001) would become 16777216 (0x0100000) that is 16 MB for reading a 1 byte string.
Of course you would say that people should agree on one standard but even 16bit unicode got little and big endianness.

Of course C would have its issues too but, would be very little affected by the issues raised here.

我三岁 2024-10-14 07:59:09

在很多方面,C 都是原始的。我喜欢它。

它比汇编语言高出一步,使用更容易编写和维护的语言为您提供几乎相同的性能。

空终止符很简单,不需要语言的特殊支持。

回想起来,好像不太方便。但我在 80 年代就使用过汇编语言,当时看起来非常方便。我只是认为软件在不断发展,平台和工具也在不断变得越来越复杂。

In many ways, C was primitive. And I loved it.

It was a step above assembly language, giving you nearly the same performance with a language that was much easier to write and maintain.

The null terminator is simple and requires no special support by the language.

Looking back, it doesn't seem that convenient. But I used assembly language back in the 80s and it seemed very convenient at the time. I just think software is continually evolving, and the platforms and tools continually get more and more sophisticated.

稚然 2024-10-14 07:59:09

假设 C 以 Pascal 方式实现字符串,通过为字符串添加长度前缀:7 个字符长的字符串与 3 个字符的字符串的数据类型相同吗?如果答案是肯定的,那么当我将前者分配给后者时,编译器应该生成什么样的代码?字符串应该被截断还是自动调整大小?如果调整大小,该操作是否应该受锁保护以使其线程安全?不管你喜欢与否,C 方法回避了所有这些问题:)

Assuming for a moment that C implemented strings the Pascal way, by prefixing them by length: is a 7 char long string the same DATA TYPE as a 3-char string? If the answer is yes, then what kind of code should the compiler generate when I assign the former to the latter? Should the string be truncated, or automatically resized? If resized, should that operation be protected by a lock as to make it thread safe? The C approach side stepped all these issues, like it or not :)

鸠书 2024-10-14 07:59:09

不知何故,我理解这个问题意味着编译器不支持 C 中的长度前缀字符串。下面的示例显示,至少您可以启动自己的 C 字符串库,其中字符串长度在编译时计算,具有如下结构:

#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })

typedef struct { int n; char * p; } prefix_str_t;

int main() {
    prefix_str_t string1, string2;

    string1 = PREFIX_STR("Hello!");
    string2 = PREFIX_STR("Allows \0 chars (even if printf directly doesn't)");

    printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
    printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */

    return 0;
}

然而,这不会没有任何问题,因为您需要小心何时专门释放该字符串指针以及何时静态分配它(文字 char 数组)。

编辑:作为对问题的更直接的回答,我的观点是,如果您需要它,但仍然没有可用的字符串长度(作为编译时常量),这是 C 可以支持的方式如果您只想使用指针和零终止,则会产生内存开销。

当然,似乎使用零终止字符串是推荐的做法,因为标准库通常不将字符串长度作为参数,并且提取长度并不像 char * s = 那样简单。 “abc”,如我的示例所示。

Somehow I understood the question to imply there's no compiler support for length-prefixed strings in C. The following example shows, at least you can start your own C string library, where string lengths are counted at compile time, with a construct like this:

#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })

typedef struct { int n; char * p; } prefix_str_t;

int main() {
    prefix_str_t string1, string2;

    string1 = PREFIX_STR("Hello!");
    string2 = PREFIX_STR("Allows \0 chars (even if printf directly doesn't)");

    printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
    printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */

    return 0;
}

This won't, however, come with no issues as you need to be careful when to specifically free that string pointer and when it is statically allocated (literal char array).

Edit: As a more direct answer to the question, my view is this was the way C could support both having string length available (as a compile time constant), should you need it, but still with no memory overhead if you want to use only pointers and zero termination.

Of course it seems like working with zero-terminated strings was the recommended practice, since the standard library in general doesn't take string lengths as arguments, and since extracting the length isn't as straightforward code as char * s = "abc", as my example shows.

琴流音 2024-10-14 07:59:09

“即使在 32 位计算机上,如果允许字符串为可用内存的大小,则长度前缀字符串仅比空终止字符串宽三个字节。”

首先,对于短字符串来说,额外的 3 个字节可能是相当大的开销。特别是,零长度字符串现在占用 4 倍的内存。我们中的一些人使用的是 64 位机器,因此我们要么需要 8 个字节来存储零长度字符串,要么字符串格式无法应对平台支持的最长字符串。

可能还需要处理对齐问题。假设我有一块内存,包含 7 个字符串,例如“solo\0second\0\0four\0 Five\0\0seventh”。第二个字符串从偏移量 5 开始。硬件可能要求 32 位整数在 4 的倍数地址处对齐,因此您必须添加填充,从而进一步增加开销。相比之下,C 表示的内存效率非常高。 (内存效率很好;例如,它有助于缓存性能。)

"Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string."

First, extra 3 bytes may be considerable overhead for short strings. In particular, a zero-length string now takes 4 times as much memory. Some of us are using 64-bit machines, so we either need 8 bytes to store a zero-length string, or the string format can't cope with the longest strings the platform supports.

There may also be alignment issues to deal with. Suppose I have a block of memory containing 7 strings, like "solo\0second\0\0four\0five\0\0seventh". The second string starts at offset 5. The hardware may require that 32-bit integers be aligned at an address that is a multiple of 4, so you have to add padding, increasing the overhead even further. The C representation is very memory-efficient in comparison. (Memory-efficiency is good; it helps cache performance, for example.)

请帮我爱他 2024-10-14 07:59:09

有一点尚未提及:当设计 C 语言时,有许多机器的“char”不是八位(即使在今天也有 DSP 平台不是八位)。如果决定字符串要加上长度前缀,那么应该使用多少个“字符”的长度前缀?使用两个会对具有 8 位字符和 32 位寻址空间的机器施加人为的字符串长度限制,而在具有 16 位字符和 16 位寻址空间的机器上浪费空间。

如果想要有效地存储任意长度的字符串,并且“char”始终为 8 位,则可以(在速度和代码大小方面付出一些代价)定义一种方案,即以偶数为前缀的字符串N 的长度为 N/2 个字节,以奇数 N 和偶数 M(向后读)为前缀的字符串可以是 ((N-1) + M*char_max)/2 等,并且要求任何缓冲区声称提供一定量的空间来容纳字符串必须在该空间之前留出足够的字节来处理最大长度。然而,“char”并不总是 8 位这一事实会使这种方案变得复杂,因为保存字符串长度所需的“char”数量会根据 CPU 架构而变化。

One point not yet mentioned: when C was designed, there were many machines where a 'char' was not eight bits (even today there are DSP platforms where it isn't). If one decides that strings are to be length-prefixed, how many 'char's worth of length prefix should one use? Using two would impose an artificial limit on string length for machines with 8-bit char and 32-bit addressing space, while wasting space on machines with 16-bit char and 16-bit addressing space.

If one wanted to allow arbitrary-length strings to be stored efficiently, and if 'char' were always 8-bits, one could--for some expense in speed and code size--define a scheme were a string prefixed by an even number N would be N/2 bytes long, a string prefixed by an odd value N and an even value M (reading backward) could be ((N-1) + M*char_max)/2, etc. and require that any buffer which claims to offer a certain amount of space to hold a string must allow enough bytes preceding that space to handle the maximum length. The fact that 'char' isn't always 8 bits, however, would complicate such a scheme, since the number of 'char' required to hold a string's length would vary depending upon the CPU architecture.

千纸鹤 2024-10-14 07:59:09

空终止允许基于快速指针的操作。

The null termination allows for fast pointer based operations.

桃扇骨 2024-10-14 07:59:09

不一定是基本原理,而是长度编码的对立

  1. 就内存而言,某些形式的动态长度编码优于静态长度编码,这完全取决于关于使用。只要看看 UTF-8 就可以得到证明。它本质上是一个用于编码单个字符的可扩展字符数组。这对每个扩展字节使用一位。 NUL 终止使用 8 位。我认为使用 64 位也可以合理地将长度前缀称为无限长度。决定性因素是你多频繁地碰到你的额外位。只有 1 个非常大的字符串?谁在乎你使用的是 8 位还是 64 位?许多小字符串(即英语单词字符串)?那么你的前缀成本会占很大比例。

  2. 允许节省时间的长度前缀字符串不是真实的事情。无论您提供的数据是否需要提供长度,您都是在编译时进行计数,还是确实提供了必须编码为字符串的动态数据。这些大小是在算法中的某个时刻计算的。可以提供一个单独的变量来存储空终止字符串的大小。这使得节省时间的比较毫无意义。一个只是在末尾有一个额外的 NUL...但如果长度编码不包含该 NUL,那么两者之间实际上没有区别。根本不需要改变算法。只是一个预传递,您必须自己手动设计,而不是让编译器/运行时为您完成。 C 主要是关于手动执行操作。

  3. 长度前缀可选是一个卖点。我并不总是需要算法的额外信息,因此需要对每个字符串执行此操作使得我的预计算+计算时间永远无法低于 O(n)。 (即硬件随机数生成器 1-128。我可以从“无限字符串”中提取。假设它只能如此快地生成字符。所以我们的字符串长度一直在变化。但我对数据的使用可能并不关心如何我有很多随机字节。它只需要在请求后立即获取下一个可用的未使用字节,但我也可以预先读取字符缓冲区。不必要的计算浪费。空检查更有效。)

  4. 长度前缀可以很好地防止缓冲区溢出?库函数和实现的合理使用也是如此。如果我传入格式错误的数据怎么办?我的缓冲区有 2 个字节长,但我告诉函数它是 7 个字节! 示例:如果gets()旨在用于已知数据,则可以进行内部缓冲区检查来测试编译缓冲区和malloc() em> 调用并仍然遵循规范。如果它的目的是用作未知 STDIN 到达未知缓冲区的管道,那么显然人们无法知道缓冲区大小,这意味着长度 arg 是毫无意义的,您在这里需要其他东西,例如金丝雀检查。就此而言,您不能为某些流和输入添加长度前缀,您就是不能。这意味着长度检查必须内置到算法中,而不是打字系统的神奇部分。 TL;DR NUL 终止从来都不是不安全的,它只是由于误用而导致这种情况。

  5. 反反点: NUL 终止在二进制文件中很烦人。您要么需要在这里进行长度前缀,要么以某种方式转换 NUL 字节:转义码、范围重新映射等……这当然意味着更多的内存使用/减少的信息/每字节更多的操作。长度前缀在这里基本上赢得了战争。转换的唯一优点是不需要编写额外的函数来覆盖长度前缀字符串。这意味着在更优化的子 O(n) 例程中,您可以让它们自动充当其 O(n) 等效例程,而无需添加更多代码。当然,缺点是在 NUL 重字符串上使用时会浪费时间/内存/压缩。 根据您最终复制的库的多少来操作二进制数据,仅使用长度前缀字符串可能是有意义的。也就是说,也可以对长度前缀字符串执行相同的操作...-1 长度可能意味着 NUL 终止,并且您可以在长度终止内使用 NUL 终止字符串。

  6. Concat:" O(n+m) vs O(m)" 我假设您将 m 称为连接后字符串的总长度,因为它们都必须具有最少的操作数(您不能只附加到字符串 1,如果必须重新分配怎么办?)。我假设 n 是由于预计算而不再需要执行的神话般的操作量。如果是这样,那么答案很简单:预先计算。 如果你坚持认为你总是有足够的内存而不需要重新分配,这就是大O表示法的基础,那么答案就更简单了:对分配的内存进行二分搜索以获取结束对于字符串 1,显然在字符串 1 之后有一大堆无限零,因此我们不必担心重新分配。在那里,很容易得到 n 到 log(n),我几乎没有尝试过。如果您还记得,在真实计算机上 log(n) 本质上只有 64,这本质上就像说 O(64+m),本质上是 O(m)。 (是的,该逻辑已用于当今使用的真实数据结构的运行时分析。这不是我凭空想象出来的废话。)

  7. Concat()/Len() 再次:记住结果。简单的。如果可能/必要,将所有计算转变为预计算。这是一个算法决定。这不是语言的强制约束。

  8. 使用 NUL 终止,字符串后缀传递更容易/可能。根据长度前缀的实现方式,它可能会破坏原始字符串,有时甚至是不可能的。需要复制并传递 O(n) 而不是 O(1)。

  9. 与长度前缀相比,NUL 终止的参数传递/取消引用更少。显然是因为您传递的信息较少。如果您不需要长度,那么这可以节省大量内存并允许优化。

  10. 你可以作弊。它实际上只是一个指针。谁说必须将其作为字符串来读取?如果您想将其读取为单个字符或浮点数怎么办?如果您想做相反的事情并将浮点数读取为字符串怎么办?如果你小心的话,你可以用 NUL 终止来做到这一点。您不能使用长度前缀来执行此操作,它是一种与通常的指针明显不同的数据类型。您很可能必须逐字节构建一个字符串并获取长度。当然,如果您想要像整个浮点数(其中可能有一个NUL)之类的东西,您无论如何都必须逐字节读取,但细节由您决定。

TL;DR 您使用的是二进制数据吗?如果不是,则 NUL 终止允许更多的算法自由。如果是,那么代码数量与速度/内存/压缩是您主要关心的问题。两种方法的混合或记忆可能是最好的。

Not a Rationale necessarily but a counterpoint to length-encoded

  1. Certain forms of dynamic length encoding are superior to static length encoding as far as memory is concerned, it all depends on usage. Just look at UTF-8 for proof. It's essentially an extensible character array for encoding a single character. This uses a single bit for each extended byte. NUL termination uses 8 bits. Length-prefix I think can be reasonably termed infinite length as well by using 64 bits. How often you hit the case of your extra bits is the deciding factor. Only 1 extremely large string? Who cares if you're using 8 or 64 bits? Many small strings (Ie Strings of English words)? Then your prefix costs are a large percentage.

  2. Length-prefixed strings allowing time savings is not a real thing. Whether your supplied data is required to have length provided, you're counting at compile time, or you're truly being provided dynamic data that you must encode as a string. These sizes are computed at some point in the algorithm. A separate variable to store the size of a null terminated string can be provided. Which makes the comparison on time-savings moot. One just has an extra NUL at the end... but if the length encode doesn't include that NUL then there's literally no difference between the two. There's no algorithmic change required at all. Just a pre-pass you have to manually design yourself instead of having a compiler/runtime do it for you. C is mostly about doing things manually.

  3. Length-prefix being optional is a selling point. I don't always need that extra info for an algorithm so being required to do it for a every string makes my precompute+compute time never able to drop below O(n). (Ie hardware random number generator 1-128. I can pull from an "infinite string". Let's say it only generates characters so fast. So our string length changes all the time. But my usage of the data probably doesn't care how many random bytes I have. It just wants the next available unused byte as soon as it can get it after a request. I could be waiting on the device. But I could also have a buffer of characters pre-read. A length comparison is a needless waste of computation. A null check is more efficient.)

  4. Length-prefix is a good guard against buffer overflow? So is sane usage of library functions and implementation. What if I pass in malformed data? My buffer is 2 bytes long but I tell the function it's 7! Ex: If gets() was intended to be used on known data it could've had an internal buffer check that tested compiled buffers and malloc() calls and still follow spec. If it was meant to be used as a pipe for unknown STDIN to arrive at unknown buffer then clearly one can't know abut the buffer size which means a length arg is pointless, you need something else here like a canary check. For that matter, you can't length-prefix some streams and inputs, you just can't. Which means the length check has to be built into the algorithm and not a magic part of the typing system. TL;DR NUL-terminated never had to be unsafe, it just ended up that way via misuse.

  5. counter-counter point: NUL-termination is annoying on binary. You either need to do length-prefix here or transform NUL bytes in some way: escape-codes, range remapping, etc... which of course means more-memory-usage/reduced-information/more-operations-per-byte. Length-prefix mostly wins the war here. The only upside to a transform is that no additional functions have to be written to cover the length-prefix strings. Which means on your more optimized sub-O(n) routines you can have them automatically act as their O(n) equivalents without adding more code. Downside is, of course, time/memory/compression waste when used on NUL heavy strings. Depending on how much of your library you end up duplicating to operate on binary data, it may make sense to work solely with length-prefix strings. That said one could also do the same with length-prefix strings... -1 length could mean NUL-terminated and you could use NUL-terminated strings inside length-terminated.

  6. Concat: "O(n+m) vs O(m)" I'm assuming your referring to m as the total length of the string after concatenating because they both have to have that number of operations minimum (you can't just tack-on to string 1, what if you have to realloc?). And I'm assuming n is a mythical amount of operations you no longer have to do because of a pre-compute. If so, then the answer is simple: pre-compute. If you're insisting you'll always have enough memory to not need to realloc and that's the basis of the big-O notation then the answer is even more simple: do binary search on allocated memory for end of string 1, clearly there's a large swatch of infinite zeros after string 1 for us to not worry about realloc. There, easily got n to log(n) and I barely tried. Which if you recall log(n) is essentially only ever as large as 64 on a real computer, which is essentially like saying O(64+m), which is essentially O(m). (And yes that logic has been used in run-time analysis of real data structures in-use today. It's not bullshit off the top of my head.)

  7. Concat()/Len() again: Memoize results. Easy. Turns all computes into pre-computes if possible/necessary. This is an algorithmic decision. It's not an enforced constraint of the language.

  8. String suffix passing is easier/possible with NUL termination. Depending on how length-prefix is implemented it can be destructive on original string and can sometimes not even be possible. Requiring a copy and pass O(n) instead of O(1).

  9. Argument-passing/de-referencing is less for NUL-terminated versus length-prefix. Obviously because you're passing less information. If you don't need length, then this saves a lot of footprint and allows optimizations.

  10. You can cheat. It's really just a pointer. Who says you have to read it as a string? What if you want to read it as a single character or a float? What if you want to do the opposite and read a float as a string? If you're careful you can do this with NUL-termination. You can't do this with length-prefix, it's a data type distinctly different from a pointer typically. You'd most likely have to build a string byte-by-byte and get the length. Of course if you wanted something like an entire float (probably has a NUL inside it) you'd have to read byte-by-byte anyway, but the details are left to you to decide.

TL;DR Are you using binary data? If no, then NUL-termination allows more algorithmic freedom. If yes, then code quantity vs speed/memory/compression is your main concern. A blend of the two approaches or memoization might be best.

南街女流氓 2024-10-14 07:59:09

许多围绕 C 的设计决策都源于这样一个事实:在最初实现它时,参数传递的成本有些昂贵。如果在例如

void add_element_to_next(arr, offset)
  char[] arr;
  int offset;
{
  arr[offset] += arr[offset+1];
}

char array[40];

void test()
{
  for (i=0; i<39; i++)
    add_element_to_next(array, i);
}

void add_element_to_next(ptr)
  char *p;
{
  p[0]+=p[1];
}

char array[40];

void test()
{
  int i;
  for (i=0; i<39; i++)
    add_element_to_next(arr+i);
}

后者之间进行选择,会稍微便宜一些(因此是首选),因为它只需要传递一个参数而不是两个。如果被调用的方法不需要知道数组的基地址或数组中的索引,则传递将两者结合起来的单个指针比单独传递值更便宜。

虽然 C 可以通过许多合理的方式对字符串长度进行编码,但迄今为止发明的方法将具有所有必需的函数,这些函数应该能够处理字符串的一部分以接受字符串的基地址,并且所需的索引作为两个单独的参数。使用零字节终止可以避免该要求。尽管其他方法对于当今的机器来说会更好(现代编译器通常在寄存器中传递参数,并且 memcpy 可以以 strcpy() 等价物不能的方式进行优化),但足够的生产代码使用零字节终止字符串,因此很难更改为其他任何内容。

PS--作为对某些操作的轻微速度损失以及较长字符串的一点点额外开销的交换,可以让处理字符串的方法直接接受指向字符串的指针,边界检查< /em> 字符串缓冲区,或标识另一个字符串的子字符串的数据结构。像“strcat”这样的函数看起来像[现代语法],

void strcat(unsigned char *dest, unsigned char *src)
{
  struct STRING_INFO d,s;
  str_size_t copy_length;

  get_string_info(&d, dest);
  get_string_info(&s, src);
  if (d.si_buff_size > d.si_length) // Destination is resizable buffer
  {
    copy_length = d.si_buff_size - d.si_length;
    if (s.src_length < copy_length)
      copy_length = s.src_length;
    memcpy(d.buff + d.si_length, s.buff, copy_length);
    d.si_length += copy_length;
    update_string_length(&d);
  }
}

比 K&R strcat 方法稍大一些,但它支持边界检查,而 K&R 方法则不支持。此外,与当前方法不同,可以轻松连接任意子字符串,例如

/* Concatenate 10th through 24th characters from src to dest */

void catpart(unsigned char *dest, unsigned char *src)
{
  struct SUBSTRING_INFO *inf;
  src = temp_substring(&inf, src, 10, 24);
  strcat(dest, src);
}

请注意,temp_substring 返回的字符串的生命周期将受到 ssrc,哪个更短(这就是为什么该方法需要传入 inf ——如果它是本地的,那么当方法返回时它就会消失)。

就内存成本而言,最大 64 字节的字符串和缓冲区将有 1 字节的开销(与以零结尾的字符串相同);较长的字符串会稍微多一些(是否允许两个字节之间的开销和所需的最大开销将是时间/空间的权衡)。长度/模式字节的特殊值将用于指示字符串函数被赋予包含标志字节、指针和缓冲区长度的结构(然后可以任意索引到任何其他字符串)。

当然,K&R 没有实现任何这样的东西,但这很可能是因为他们不想在字符串处理上花费太多精力——即使在今天,许多语言在这个领域似乎也相当贫乏。

Many design decisions surrounding C stem from the fact that when it was originally implemented, parameter passing was somewhat expensive. Given a choice between e.g.

void add_element_to_next(arr, offset)
  char[] arr;
  int offset;
{
  arr[offset] += arr[offset+1];
}

char array[40];

void test()
{
  for (i=0; i<39; i++)
    add_element_to_next(array, i);
}

versus

void add_element_to_next(ptr)
  char *p;
{
  p[0]+=p[1];
}

char array[40];

void test()
{
  int i;
  for (i=0; i<39; i++)
    add_element_to_next(arr+i);
}

the latter would have been slightly cheaper (and thus preferred) since it only required passing one parameter rather than two. If the method being called didn't need to know the base address of the array nor the index within it, passing a single pointer combining the two would be cheaper than passing the values separately.

While there are many reasonable ways in which C could have encoded string lengths, the approaches that had been invented up to that time would have all required functions that should be able to work with part of a string to accept the base address of the string and the desired index as two separate parameters. Using zero-byte termination made it possible to avoid that requirement. Although other approaches would be better with today's machines (modern compilers often pass parameters in registers, and memcpy can be optimized in ways strcpy()-equivalents cannot) enough production code uses zero-byte terminated strings that it's hard to change to anything else.

PS--In exchange for a slight speed penalty on some operations, and a tiny bit of extra overhead on longer strings, it would have been possible to have methods that work with strings accept pointers directly to strings, bounds-checked string buffers, or data structures identifying substrings of another string. A function like "strcat" would have looked something like [modern syntax]

void strcat(unsigned char *dest, unsigned char *src)
{
  struct STRING_INFO d,s;
  str_size_t copy_length;

  get_string_info(&d, dest);
  get_string_info(&s, src);
  if (d.si_buff_size > d.si_length) // Destination is resizable buffer
  {
    copy_length = d.si_buff_size - d.si_length;
    if (s.src_length < copy_length)
      copy_length = s.src_length;
    memcpy(d.buff + d.si_length, s.buff, copy_length);
    d.si_length += copy_length;
    update_string_length(&d);
  }
}

A little bigger than the K&R strcat method, but it would support bounds-checking, which the K&R method doesn't. Further, unlike the current method, it would be possible to easily concatenate an arbitrary substring, e.g.

/* Concatenate 10th through 24th characters from src to dest */

void catpart(unsigned char *dest, unsigned char *src)
{
  struct SUBSTRING_INFO *inf;
  src = temp_substring(&inf, src, 10, 24);
  strcat(dest, src);
}

Note that the lifetime of the string returned by temp_substring would be limited by those of s and src, which ever was shorter (which is why the method requires inf to be passed in--if it was local, it would die when the method returned).

In terms of memory cost, strings and buffers up to 64 bytes would have one byte of overhead (same as zero-terminated strings); longer strings would have slightly more (whether one allowed amounts of overhead between two bytes and the maximum required would be a time/space tradeoff). A special value of the length/mode byte would be used to indicate that a string function was given a structure containing a flag byte, a pointer, and a buffer length (which could then index arbitrarily into any other string).

Of course, K&R didn't implement any such thing, but that's most likely because they didn't want to spend much effort on string handling--an area where even today many languages seem rather anemic.

北凤男飞 2024-10-14 07:59:09

根据 Joel Spolsky 在这篇博文中的说法,

这是因为发明了 UNIX 和 C 编程语言的 PDP-7 微处理器具有 ASCIZ 字符串类型。 ASCIZ 的意思是“末尾带有 Z(零)的 ASCII。”

在这里看到所有其他答案后,我确信即使这是真的,这也只是 C 具有以 null 结尾的“字符串”的部分原因。这篇文章非常有启发性地说明了像字符串这样简单的东西实际上是多么困难。

According to Joel Spolsky in this blog post,

It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

After seeing all the other answers here, I'm convinced that even if this is true, it's only part of the reason for C having null-terminated "strings". That post is quite illuminating as to how simple things like strings can actually be quite hard.

吖咩 2024-10-14 07:59:09

我不相信“C 没有字符串”的答案。诚然,C 不支持内置的高级类型,但您仍然可以用 C 表示数据结构,这就是字符串。事实上,字符串只是 C 中的指针,并不意味着前 N 个字节不能作为长度具有特殊含义。

Windows/COM 开发人员会非常熟悉 BSTR 类型,它完全是这样的 - 一个带有长度前缀的 C 字符串,其中实际字符数据不是从字节 0 开始

。似乎使用空终止的决定只是人们的偏好,而不是该语言的必要条件。

I don't buy the "C has no string" answer. True, C does not support built-in higher-level types but you can still represent data-structures in C and that's what a string is. The fact a string is just a pointer in C does not mean the first N bytes cannot take on special meaning as a the length.

Windows/COM developers will be very familiar with the BSTR type which is exactly like this - a length-prefixed C string where the actual character data starts not at byte 0.

So it seems that the decision to use null-termination is simply what people preferred, not a necessity of the language.

黯然#的苍凉 2024-10-14 07:59:09

NUL 终止相对于长度前缀的一个优点是字符串比较的简单性(我没有看到任何人提到过)。考虑比较标准,它返回小于、等于或大于的有符号结果。对于长度前缀,算法必须遵循以下原则:

  1. 比较两个长度;记录较小的值,并注意它们是否相等(最后一步可能会推迟到步骤 3)。
  2. 扫描两个字符序列,减去匹配索引处的字符(或使用双指针扫描)。当差异不为零时停止,返回差异,或者当扫描的字符数等于较小的长度时停止。
  3. 当达到较小的长度时,一个字符串是另一个字符串的前缀。根据较短者返回负值或正值,如果长度相等则返回零。

将此与 NUL 终止算法进行对比:

  1. 扫描两个字符序列,减去匹配索引处的字符[请注意,通过移动指针可以更好地处理此问题]。当差异不为零时停止,返回差异。注意:如果一个字符串是另一个字符串的 PROPER 前缀,则减法中的一个字符将为 NUL,即零,比较自然会在那里停止。
  2. 如果差异为零,则仅检查任一字符是否为 NUL。如果是,则返回零,否则继续下一个字符。

NUL 终止的情况更简单,并且很容易通过双指针扫描有效地实现。带长度前缀的大小写至少能完成同样多的工作,而且几乎总是更多。如果您的算法必须进行大量字符串比较(例如编译器!),则以 NUL 结尾的情况会胜出。现在这可能不那么重要了,但在过去,哎呀,是的。

One advantage of NUL-termination over length-prefixing, which I have not seen anyone mention, is the simplicity of string comparison. Consider the comparison standard which returns a signed result for less-than, equal, or greater-than. For length-prefixing the algorithm has to be something along the following lines:

  1. Compare the two lengths; record the smaller, and note if they are equal (this last step might be deferred to step 3).
  2. Scan the two character sequences, subtracting characters at matching indices (or use a dual pointer scan). Stop either when the difference is nonzero, returning the difference, or when the number of characters scanned is equal to the smaller length.
  3. When the smaller length is reached, one string is a prefix of the other. Return negative or positive value according to which is shorter, or zero if of equal length.

Contrast this with the NUL-termination algorithm:

  1. Scan the two character sequences, subtracting characters at matching indices [note that this is handled better with moving pointers]. Stop when the difference is nonzero, returning the difference. NOTE: If one string is a PROPER prefix of the other, one of the characters in the subtraction will be NUL, i.e zero, and the comparison will naturally stop there.
  2. If the difference is zero, -only then- check if either character is NUL. If so, return zero, otherwise continue to next character.

The NUL-terminated case is simpler, and very easy to implement efficiently with a dual pointer scan. The length-prefixed case does at least as much work, nearly always more. If your algorithm has to do a lot of string comparisons [e.g a compiler!], the NUL-terminated case wins out. Nowadays that might not be as important, but back in the day, heck yeah.

心作怪 2024-10-14 07:59:09

gcc 接受以下代码:

char s[4] = "abcd";

如果我们将 is 视为字符数组而不是字符串,那就可以了。也就是说,我们可以使用 s[0]、s[1]、s[2] 和 s[3] 甚至 memcpy(dest, s, 4) 来访问它。但是当我们尝试使用 put(s) 时,我们会得到混乱的字符,或者更糟糕的是使用 strcpy(dest, s)。

gcc accept the codes below:

char s[4] = "abcd";

and it's ok if we treat is as an array of chars but not string. That is, we can access it with s[0], s[1], s[2], and s[3], or even with memcpy(dest, s, 4). But we'll get messy characters when we trying with puts(s), or worse with strcpy(dest, s).

一直在等你来 2024-10-14 07:59:09

我认为更好的问题是为什么你认为 C 欠你什么? C 的设计就是为了给你你所需要的,仅此而已。你需要摆脱那种认为语言必须为你提供一切的心态。或者继续使用高级语言,它们将为您提供字符串、日历、容器的奢华;就 Java 而言,你得到的东西有无数种。多种类型的String,多种类型的unordered_map(s)。

对你来说太糟糕了,这不是 C 的目的。C 并不是被设计成一种臃肿的语言,提供从固定到锚的功能。相反,您必须依赖第三方库或您自己的库。没有什么比创建一个包含字符串及其大小的简单结构更容易的了。

struct String
{
 const char *s;
 size_t len;
};

不过你知道这有什么问题。这不是标准的。另一种语言可能决定将 len 组织在字符串之前。另一种语言可能决定使用指针来代替。另一个可能决定使用六个指针来提高字符串的效率。然而,空终止字符串是最标准的字符串格式;您可以使用它与任何语言进行交互。甚至 Java JNI 也使用以 null 结尾的字符串。

最后,这是一句俗话;任务的正确数据结构。如果您发现最需要了解字符串的大小;我们将使用允许您以最佳方式做到这一点的字符串结构。但不要声称该操作对每个人来说都比其他操作更常用。比如,为什么知道字符串的大小比读取其内容更重要。我发现读取字符串的内容是我最常做的事情,所以我使用 null 终止的字符串而不是 std::string;这为我在 GCC 编译器上节省了 5 个指针。如果我能保存 2 个指针那就太好了。

I think the better question is why you think C owes you anything? C was designed to give you what you need, nothing more. You need to loose the mentality that the language must provide you with everything. Or just continue to use your higher level languages that will give you the luxary of String, Calendar, Containers; and in the case of Java you get one thing in tonnes of variety. Multiple types String, multiple types of unordered_map(s).

Too bad for you, this was not the purpose of C. C was not designed to be a bloated language that offers from a pin to an anchor. Instead you must rely on third party libraries or your own. And there is nothing easier than creating a simple struct that will contain a string and its size.

struct String
{
 const char *s;
 size_t len;
};

You know what the problem is with this though. It is not standard. Another language might decide to organize the len before the string. Another language might decide to use a pointer to end instead. Another might decide to use six pointers to make the String more efficient. However a null terminated string is the most standard format for a string; which you can use to interface with any language. Even Java JNI uses null terminated strings.

Lastly, it is a common saying; the right data structure for the task. If you find that need to know the size of a string more than anything else; well use a string structure that allows you to do that optimally. But don't make claims that that operation is used more than anything else for everybody. Like, why is knowing the size of a string more important than reading its contents. I find that reading the contents of a string is what I mostly do, so I use null terminated strings instead of std::string; which saves me 5 pointers on a GCC compiler. If I can even save 2 pointers that is good.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文