为什么 C 不使用特殊的转义字符串终止字符来终止字符串?
在 C 中,字符串以 null ( \0 ) 终止,当您想在字符串中放入 null 时,这会导致问题。 为什么不使用特殊的转义字符,例如 \$ 或其他字符?
我完全知道这个问题有多么愚蠢,但我很好奇。
In C, strings are terminated with null ( \0 ) which causes problems when you want to put a null in a strings. Why not have a special escaped character such as \$ or something?
I am fully aware at how dumb this question is, but I was curious.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
以 0 结尾有许多性能细节,这在 60 年代末期非常重要。
CPU 具有在测试 0 时进行条件跳转的指令。事实上,某些 CPU 甚至具有迭代/复制字节序列直至 0 的指令。
如果您使用转义字符,则有两个测试要断言两个不同的字节字符串的末尾。 不仅速度较慢,而且您失去了一次迭代一个字节的能力,因为您需要前瞻或回溯的能力。
现在,其他语言(咳嗽、帕斯卡、咳嗽)以计数/值样式使用字符串。 对于他们来说,任何字符都是有效的,但他们总是保留一个带有字符串大小的计数器。 优点很明显,但这种技术也有缺点。
一方面,字符串大小受到计数占用的字节数的限制。 一个字节给你 255 个字符,两个字节给你 65535 个字符,等等。这在今天可能几乎无关紧要,但向每个字符串添加两个字节一次是相当昂贵的。
编辑:
我不认为这个问题很愚蠢。 如今,高级语言具有内存管理、令人难以置信的 CPU 能力和惊人的内存量,过去的此类决定似乎毫无意义。 事实上,他们现在可能毫无意义,所以质疑他们是一件好事。
Terminating with a 0 has many performance niceties, which were very much relevant back in the late 60s.
CPUs have instructions for conditional jump on test for 0. In fact, some CPUs even have instructions which will iterate/copy a sequence of bytes up to the 0.
If you used an escaped character instead, you have two test TWO different bytes to assert the end of the string. Not only that's slower, but you lose the ability to iterate one byte at a time, as you need a look-ahead or the ability to backtrack.
Now, other languages (cough, Pascal, cough) use strings in a count/value style. For them, any character is valid, but they always keep a counter with the size of the string. The advantage is clear, but there are disadvantages to this technique too.
For one thing, the string size is limited by the number of bytes the count takes. One byte gives you 255 characters, two bytes gives you 65535, etc. It might be almost irrelevant today, but adding two bytes to every string once was quite expensive.
Edit:
I do not think the question is dumb. In these days of high level languages with memory management, incredible CPU power and obscene amounts of memory, such decisions from the past can well seem senseless. And, indeed, they MIGHT be senseless nowadays, so it's a fine thing to question them.
您需要有一些实际字节值来终止字符串 - 如何在代码中表示它并不真正相关。
如果您使用
\$
来终止字符串,那么它在内存中会有什么字节值? 如何将该字节值包含在字符串中?如果您使用特殊字符来终止字符串,那么无论您做什么,都会遇到这个问题。 另一种方法是使用计数字符串,其中字符串的表示形式包括其长度(例如,BSTR)。
You need to have some actual byte value to terminate a string - how you represent it in code isn't really relevant.
If you used
\$
to terminate strings, what byte value would it have in memory? How would you include that byte value in a string?You're going to hit this problem whatever you do, if you use a special character to terminate strings. The alternative is to use counted strings, whereby the representation of a string includes its length (eg. BSTR).
我猜是因为检查速度更快,并且完全不可能出现在合理的字符串中。
另外,请记住 C 没有字符串的概念。 C 中的字符串本身并不是某种东西。 它只是一个字符数组。 事实上,它被作为字符串调用和使用纯粹是偶然的和约定俗成的。
I guess because it's faster to check, and totally improbable to occur in a reasonable string.
Also, remember that C has no concept of strings. A string in C is not something by itself. It's just an array of characters. The fact that it's called and used as a string is purely incidental and conventional.
它会导致问题,但您可以嵌入 \0 ...
如果您将其传递给标准库函数(如
strlen
),则会导致问题,但否则不会。比任何字符串终止字符更好的解决方案可能是在字符串前面加上字符串的长度,例如
......这是其他一些语言的做法。
It causes problems but you can embed a \0 ...
It causes a problem if you pass this to a standard library functions like
strlen
, but not otherwise.A better solution than any string-terminating character might be to prepend the length of the string like ...
... which is the way some other languages do it.
如果像 strlen 或 printf 这样的标准库函数可以(选项明智地)查找字符串结尾标记 \777 (作为 \000 的替代),那么您可能有一个包含 \0s 的常量字符串:
顺便说一句,如果你想将 \0 发送到 stdout (又名 -print0),你可以使用:
If standard library functions like strlen or printf could (option-wise) look for a end-of-string marker \777 (as an alternative to \000), you could have a constant character string containing \0s:
By the way, if you want to send a \0 to stdout (aka -print0) you may use:
历史原因也是如此。
C++ 中 std::string 的创建者认识到了这个缺点,因此 std::string 可以包含空字符。 (但要小心 用 a 构造 std::string null 字符!)
如果你想要一个带有 null 字符的 C 字符串(或者更确切地说,一个准 C 字符串),你必须 make 来创建你自己的结构。
或者您必须以其他方式跟踪字符串长度并将其传递给您编写的每个字符串函数。
Ditto on the historical reasons.
The creators of std::string in C++ recognized this shortcoming, so std::string can include the null character. (But be careful constructing a std::string with a null character!)
If you want to have a C-string (or rather, a quasi-C-string) with a null character, you will have to make to make your own struct.
Or you'll have to keep track of the string length in some other way and pass it to every string function that you write.
不是故意死后,但这仍然与嵌入式 SQL 高度相关。
如果您正在用 C 处理二进制数据,则应该在数据结构中创建一个二进制对象。 如果你能负担得起,一个 char 数组就足够了。 无论如何,它可能不是一个字符串,是吗?
对于哈希/摘要值,通常将它们“十六进制”转换为 {'0',..,'F'} 的成员。
然后可以在数据库操作期间对它们进行“UNHEXED”。
对于文件操作,请考虑具有逻辑记录长度的二进制流。
只有当您能够保证编码时,自己转义它们才是真正安全的。 事实上,这可以在 MYSQLDUMP (SQL) 卸载中看到,其中二进制文件被正确转义为 UTF-8,并且安装方案在加载时“推送”,然后“弹出”。
我也不提倡使用 dbms 调用来调用库函数,但我已经看到它完成了。 (选择 real_escape_string ($string))。
还有 Base64,这是另一种蠕虫病毒。 谷歌UUENCODE。
所以,是的,如果你的字符是固定宽度的,mem* 就会起作用。
Not to necro-post deliberately, but this is still highly relevant for embedded SQL.
If you are dealing with binary data in C, you should be creating a binary object in a data stucture. If you can afford it, an array of char will suffice. It probably isn't a string anyway, is it ?
For hash / digest values, it is common to "HEX" them out into members of {'0',..,'F'}.
These can then be "UNHEXED" during the database operation.
For file operations, consider a binary stream, with a logical record length.
Escaping them yourself is only really safe if you can guarantee the encoding. In fact this can be seen in a MYSQLDUMP (SQL) unload where the binaries are properly escaped for UTF-8 say, and the installation scheme is 'pushed' for the load and 'popped' afterwards.
I don't advocate using a dbms call for what should be a library function either, but I have seen it done. (select of real_escape_string ($string)).
And there's base64, which is another can of worms. Google UUENCODE.
So yeah, mem* functions if your characters are fixed width.
除了作为终止符之外,没有理由让 null 字符成为字符串的一部分; 它没有图形表示,因此您看不到它,也不充当控制字符。 就文本而言,它是在不使用不同表示形式的情况下可以获得的带外值(例如,像 0xFFFF 这样的多字节值)。
稍微改一下迈克尔的问题,你期望如何处理“Hello\0World\0”?
There is no reason for a nul character to be part of a string except as a terminator; it has no graphical representation, so you wouldn't see it, nor does it act as a control character. As far as text is concerned, it's as out-of-band a value as you can get without using a different representation (e.g., a multibyte value like 0xFFFF).
To slightly rephrase Michael's question, how would you expect "Hello\0World\0" to be handled?