character-encoding c lexical-analysis tokenize

词法分析器/分词器和字符集

发布于 2024-08-20 16:21:02 字数 222 浏览 14 评论 0原文

构建词法分析器/分词器时，依赖诸如 isdigit/isalpha/... 之类的函数（在 C 中）是否是一个错误？据我所知，它们取决于语言环境。我是否应该选择一个字符集并专注于它并自己制作一个字符映射并从中查找分类？那么问题就变成了能够词法多个字符集。我是否为每个字符集生成一个词法分析器/分词器，或者尝试对我编写的词法分析器/分词器进行编码，以便我唯一要做的就是更改字符映射。常见做法有哪些？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

生死何惧 2024-08-27 16:21:02

现在，我将集中精力首先使用纯 ASCII 字符集让词法分析器工作，然后当词法分析器工作时，添加对不同字符类型的映射支持，例如 UTF-16 和区域设置支持。

不，依赖ctype的函数如isdigit、isalpha等等并不是一个错误......

实际上，也许在后期，有一个Posix 相当于宽字符 ctype 'wctype.h'因此，稍后定义一个宏可能符合您的最佳利益...这样您就能够透明地更改代码以处理不同的语言环境集...

#ifdef LEX_WIDECHARS
#include <wctype.h>
#define isdigit  iswdigit
#else
#define isdigit
#endif

它将在该上下文中定义类似的内容..希望

这有帮助，
此致，
汤姆.

For now, I would concentrate on getting the lexer working first using the plain ASCII character set, then when the lexer is working, put in a mapping support for different character types such as UTF-16 and locale support.

And no, it is not a mistake to rely on the ctype's functions such as isdigit, isalpha and so on...

Actually, maybe at a later stage, there is a Posix equivalent of ctype for wide characters 'wctype.h' so it might be in your best interests to define a macro, later on...so that you will be able to transparently change the code to handle the different locale sets...

#ifdef LEX_WIDECHARS
#include <wctype.h>
#define isdigit  iswdigit
#else
#define isdigit
#endif

It would be defined something like that in that context...

Hope this helps,
Best regards,
Tom.

回复收藏 0 原文

迷雾森÷林ヴ 2024-08-27 16:21:02

ctype.h 函数对于包含除 ASCII 之外的任何内容的字符不是很有用。无论系统区域设置是什么，默认区域设置都是C（与大多数机器上的 ASCII 基本相同）。即使您使用 setlocale 更改区域设置，系统也可能使用大于 8 位字符的字符集（例如 UTF-8），在这种情况下，您无法从单个字符。

宽字符可以正确处理更多情况，但即使它们也经常失败。

因此，如果您想可靠地支持非 ASCII isspace，您必须自己完成（或者可能使用现有的库）。

注意：ASCII 仅具有字符代码 0-127（或 32-127），有些人所说的 8 位 ASCII 实际上是其他字符集（通常是 CP437、CP1252、ISO-8859-1，通常还有其他字符集）。

回复收藏 0 原文