当前位置：文江博客话题详情

ANSI C 中的匹配单词

发布于 2024-12-21 12:56:40 字数 240 浏览 2 评论 0原文

如何匹配 ANSI C 中的单词（1-n 个字符）？（另外：匹配 C 源代码中的常量的模式是什么？）

我尝试读取该文件并将其传递给 regexec() (regex.h)。问题：我正在编写的工具应该能够读取源代码并找到全部使用常量（#define）来检查它们是否已定义。

用于测试的模式为：[a-zA-Z_0-9]{1,}。但这会匹配诸如“test.h”中的“h”之类的单词。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

Smile简单爱 2024-12-28 12:56:40

标识符必须以字母或下划线开头，因此据

[A-Za-z_][A-Za-z0-9_]*

我所知，C 标识符和预处理器标识符之间没有语法差异。约定使用大写字母表示预处理器，使用小写字母表示 C 标识符，但没有实际要求。除非定义保证使用不同的命名约定，否则您基本上必须找到源文件和任何包含的文件中的每个标识符，并将它们分类为预处理器标识符、C 标识符和未声明的标识符。

来自海湾合作委员会手册：

预处理标记分为五类：标识符、预处理数字、字符串文字、标点符号等。标识符与 C 中的标识符相同：以字母或下划线开头的任何字母、数字或下划线序列。 C的关键字对预处理器没有意义；它们是普通的标识符。例如，您可以定义一个名称为关键字的宏。定义了唯一可以被视为预处理关键字的标识符。

Identifiers must start with a letter or underscore, so the pattern is

[A-Za-z_][A-Za-z0-9_]*

I know of no syntactic difference between C and preprocessor identifiers. There is a convention to use upper case for preprocessor and lowercase for C identifiers, but no actual requirement. Unless defines are guaranteed to use a distinct naming convention you would basically have to find every identifier in the source file and any included files and sort them into preprocessor identifiers, C identifiers and undeclared identifiers.

From the GCC manual:

Preprocessing tokens fall into five broad classes: identifiers, preprocessing numbers, string literals, punctuators, and other. An identifier is the same as an identifier in C: any sequence of letters, digits, or underscores, which begins with a letter or underscore. Keywords of C have no significance to the preprocessor; they are ordinary identifiers. You can define a macro whose name is a keyword, for instance. The only identifier which can be considered a preprocessing keyword is defined.

回复收藏 0 原文

黯然 2024-12-28 12:56:40

除了对 C 源代码进行正则表达式搜索之外，另一个选择是使用预处理器库，例如 Boost Wave 或者类似 Coan 的东西，而不是从头开始。

回复收藏 0 原文

不羁少年 2024-12-28 12:56:40

这是 Lexer 语法和解析器语法（分别采用flex和bison格式）语言。特别地，与标识符相关的部分是：

D           [0-9]
L           [a-zA-Z_]
{L}({L}|{D})*       { count(); return(check_type()); }

所以id可以以任何大小写字母或下划线开头，然后有更多的大小写字母、下划线和数字。我相信它与文件名的部分内容不匹配，因为它们被引用并且它单独处理引号。

Here is the Lexer grammar and the Parser grammar (in flex and bison format, respectively) for the entire c language. In particular, the part relevant to identifiers is:

D           [0-9]
L           [a-zA-Z_]
{L}({L}|{D})*       { count(); return(check_type()); }

So the id can start with any uppercase or lowercase letter or an underscore, and then have more uppercase or lowercase letters, underscores, and numbers. I believe it doesn't match parts of file names because they're quoted and it handles quotes separately.

回复收藏 0 原文

~没有更多了~