C 和 C++ 中字符串文字连接的实现

发布于 2024-09-07 21:16:02 字数 1402 浏览 15 评论 0原文

AFAIK，这个问题同样适用于C和C++

C标准中指定的“翻译阶段”的第6步（5.1.1.2） C99 标准草案）规定相邻的字符串文字必须连接成单个文字。 Ie

printf("helloworld.c" ": %d: Hello "
       "world\n", 10);

相当于（语法上）：

printf("helloworld.c: %d: Hello world\n", 10);

但是，标准似乎没有指定编译器的哪一部分必须处理这个问题 - 应该是预处理器 (cpp) 还是编译器本身。一些在线研究告诉我，此功能通常预计由预处理器执行（源#1，源#2< /a>，还有更多），这是有道理的。

然而，在 Linux 中运行 cpp 表明 cpp 并没有做到这一点：

eliben@eliben-desktop:~/test$ cat cpptest.c 
int a = 5;

"string 1" "string 2"
"string 3"

eliben@eliben-desktop:~/test$ cpp cpptest.c 
# 1 "cpptest.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "cpptest.c"
int a = 5;

"string 1" "string 2"
"string 3"

所以，我的问题是：应该在哪里处理语言的这个功能，在预处理器中还是在编译器本身？

也许没有一个好的答案。基于经验、已知编译器和一般良好工程实践的启发式答案将受到赞赏。

PS 如果您想知道我为什么关心这个...我正在尝试弄清楚我的是否基于Python的C解析器应该处理字符串文字连接（目前它不这样做），或者将其留给它假设在其之前运行的cpp。

原文

AFAIK, this question applies equally to C and C++

Step 6 of the "translation phases" specified in the C standard (5.1.1.2 in the draft C99 standard) states that adjacent string literals have to be concatenated into a single literal. I.e.

printf("helloworld.c" ": %d: Hello "
       "world\n", 10);

Is equivalent (syntactically) to:

printf("helloworld.c: %d: Hello world\n", 10);

However, the standard doesn't seem to specify which part of the compiler has to handle this - should it be the preprocessor (cpp) or the compiler itself. Some online research tells me that this function is generally expected to be performed by the preprocessor (source #1, source #2, and there are more), which makes sense.

However, running cpp in Linux shows that cpp doesn't do it:

eliben@eliben-desktop:~/test$ cat cpptest.c 
int a = 5;

"string 1" "string 2"
"string 3"

eliben@eliben-desktop:~/test$ cpp cpptest.c 
# 1 "cpptest.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "cpptest.c"
int a = 5;

"string 1" "string 2"
"string 3"

So, my question is: where should this feature of the language be handled, in the preprocessor or the compiler itself?

Perhaps there's no single good answer. Heuristic answers based on experience, known compilers, and general good engineering practice will be appreciated.

P.S. If you're wondering why I care about this... I'm trying to figure out whether my Python based C parser should handle string literal concatenation (which it doesn't do, at the moment), or leave it to cpp which it assumes runs before it.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

℉服软 2024-09-14 21:16:02

该标准没有指定预处理器与编译器，它只是指定您已经注意到的翻译阶段。传统上，阶段 1 到 4 位于预处理器中，阶段 5 到 7 位于编译器中，阶段 8 位于链接器中，但标准中没有任何要求。

回复收藏 0 原文

忆伤 2024-09-14 21:16:02

除非指定预处理器来处理此问题，否则可以安全地假设这是编译器的工作。

编辑：

帖子开头的“Ie”链接回答了问题：

相邻的字符串文字在编译时连接；这允许将长字符串拆分为多行，并且还允许将 C 预处理器定义和宏产生的字符串文字在编译时附加到字符串...

回复收藏 0 原文

没企图 2024-09-14 21:16:02

在 ANSI C 标准中，此详细信息包含在第 5.1.1.2 节第 (6) 项中：

5.1.1.2 翻译阶段
...
4. 执行预处理指令并扩展宏调用。 ...
5. 字符常量和字符串文字中的每个源字符集成员和转义序列都转换为执行字符集的成员。
6. 相邻的字符串文字标记被连接，相邻的宽字符串文字标记被连接。

该标准本身并未定义实现必须使用预处理器和编译器。

步骤 4 显然是预处理器的责任。

步骤5要求知道“执行字符集”。编译器也需要此信息。如果预处理器不包含平台依赖性，则更容易将编译器移植到新平台，因此趋势是在编译器中实现步骤 5，从而实现步骤 6。

回复收藏 0 原文

当梦初醒 2024-09-14 21:16:02

我会在解析器的扫描令牌部分中处理它，所以在编译器中。这似乎更符合逻辑。预处理器不必知道语言的“结构”，事实上它通常会忽略它，因此宏可能会生成无法编译的代码。它只处理它有权通过专门针对它的指令（# ...）处理的内容，以及它们的“后果”（就像 #define 的结果） x h，这会让预处理器将很多 x 变成 h)

回复收藏 0 原文

策马西风 2024-09-14 21:16:02

字符串文字连接如何与转义序列交互有一些棘手的规则。
假设

const char x1[] = "a\15" "4";
const char y1[] = "a\154";
const char x2[] = "a\r4";
const char y2[] = "al";

根据 strcmp，x1 和 x2 必须相等，对于 y1 和 <代码>y2。（这就是 Heath 在引用翻译步骤时所要表达的意思 - 转义转换发生在字符串常量连接之前。）还有一个要求，如果串联组具有 L 或 U 前缀，您将获得宽字符串或 Unicode 字符串。把它们放在一起，作为“编译器”而不是“预处理器”的一部分来完成这项工作会更加方便。

There are tricky rules for how string literal concatenation interacts with escape sequences.
Suppose you have

const char x1[] = "a\15" "4";
const char y1[] = "a\154";
const char x2[] = "a\r4";
const char y2[] = "al";

then x1 and x2 must wind up equal according to strcmp, and the same for y1 and y2. (This is what Heath is getting at in quoting the translation steps - escape conversion happens before string constant concatenation.) There's also a requirement that if any of the string constants in a concatenation group has an L or U prefix, you get a wide or Unicode string. Put it all together and it winds up being significantly more convenient to do this work as part of the "compiler" rather than the "preprocessor."