词法分析器/分词器和字符集

发布于 2024-08-20 16:21:02 字数 222 浏览 4 评论 0原文

构建词法分析器/分词器时,依赖诸如 isdigit/isalpha/... 之类的函数(在 C 中)是否是一个错误?据我所知,它们取决于语言环境。我是否应该选择一个字符集并专注于它并自己制作一个字符映射并从中查找分类?那么问题就变成了能够词法多个字符集。我是否为每个字符集生成一个词法分析器/分词器,或者尝试对我编写的词法分析器/分词器进行编码,以便我唯一要做的就是更改字符映射。常见做法有哪些?

When constructing a lexer/tokenizer is it a mistake to rely on functions (in C) such as isdigit/isalpha/...? They are dependent on locale as far as I know. Should I pick a character set and concentrate on it and make a character mapping myself from which I look up classifications? Then the problem becomes being able to lex multiple character sets. Do I produce one lexer/tokenizer for each character set or do I try to code the one I wrote so that the only thing I have to do is change the character mapping. What are common practices?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

生死何惧 2024-08-27 16:21:02

现在,我将集中精力首先使用纯 ASCII 字符集让词法分析器工作,然后当词法分析器工作时,添加对不同字符类型的映射支持,例如 UTF-16 和区域设置支持。

不,依赖ctype的函数如isdigitisalpha等等并不是一个错误......

实际上,也许在后期,有一个Posix 相当于宽字符 ctype 'wctype.h'因此,稍后定义一个宏可能符合您的最佳利益...这样您就能够透明地更改代码以处理不同的语言环境集...

#ifdef LEX_WIDECHARS
#include <wctype.h>
#define isdigit  iswdigit
#else
#define isdigit
#endif

它将在该上下文中定义类似的内容..希望

这有帮助,
此致,
汤姆.

For now, I would concentrate on getting the lexer working first using the plain ASCII character set, then when the lexer is working, put in a mapping support for different character types such as UTF-16 and locale support.

And no, it is not a mistake to rely on the ctype's functions such as isdigit, isalpha and so on...

Actually, maybe at a later stage, there is a Posix equivalent of ctype for wide characters 'wctype.h' so it might be in your best interests to define a macro, later on...so that you will be able to transparently change the code to handle the different locale sets...

#ifdef LEX_WIDECHARS
#include <wctype.h>
#define isdigit  iswdigit
#else
#define isdigit
#endif

It would be defined something like that in that context...

Hope this helps,
Best regards,
Tom.

迷雾森÷林ヴ 2024-08-27 16:21:02

ctype.h 函数对于包含除 ASCII 之外的任何内容的字符不是很有用。无论系统区域设置是什么,默认区域设置都是C(与大多数机器上的 ASCII 基本相同)。即使您使用 setlocale 更改区域设置,系统也可能使用大于 8 位字符的字符集(例如 UTF-8),在这种情况下,您无法从单个字符。

宽字符可以正确处理更多情况,但即使它们也经常失败。

因此,如果您想可靠地支持非 ASCII isspace,您必须自己完成(或者可能使用现有的库)。

注意:ASCII 仅具有字符代码 0-127(或 32-127),有些人所说的 8 位 ASCII 实际上是其他字符集(通常是 CP437、CP1252、ISO-8859-1,通常还有其他字符集)。

The ctype.h functions are not very usable for chars that contain anything but ASCII. The default locale is C (essentially the same as ASCII on most machines), no matter what the system locale is. Even if you use setlocale to change the locale, the chances are that the system uses a character set with bigger than 8 bit characters (e.g. UTF-8), in which case you cannot tell anything useful from a single char.

Wide chars handle more cases properly, but even they fail too often.

So, if you want to support non-ASCII isspace reliably, you have to do it yourself (or possibly use an existing library).

Note: ASCII only has character codes 0-127 (or 32-127) and what some call 8 bit ASCII is actually some other character set (commonly CP437, CP1252, ISO-8859-1 and often also something else).

你的笑 2024-08-27 16:21:02

您可能不会在尝试构建本地敏感解析器方面走得太远——它会让您发疯。 ASCII 可以很好地满足大多数解析需求——不要对抗它 :D

如果你确实想对抗它并使用一些字符分类,你应该查看 ICU 虔诚地实现 Unicode 的库。

You are likely not to get very far in trying to build a local sensitive parser -- it will drive you mad. ASCII works fine for most parsing needs -- don't fight it :D

If you do want to fight it and use some of the classifications of characters you should look to the ICU library that implements Unicode religiously.

万人眼中万个我 2024-08-27 16:21:02

一般来说你需要问自己:

  • 你到底想做什么,什么样的解析?
  • 您想要支持哪些语言,广泛的语言还是仅支持西欧语言?
  • 您想使用什么编码 UTF-8 或本地化 8 位编码?
  • 您使用什么操作系统?

让我们开始,如果您使用具有本地化 8 位编码的西方语言,那么可能是的,如果安装并配置了语言环境,您可以继续使用 is*。

但是:

  • 如果您使用 UTF-8,则不能,因为只有 ASCII 会被覆盖,因为 ASCII 之外的所有内容都需要多于一个字节。
  • 如果你想支持东方语言,你所有关于解析的假设都是错误的,就像中文不使用空格来分隔单词一样。大多数语言甚至没有大小写,甚至没有字母表,如希伯来语或阿拉伯语。

那么,您究竟想做什么?

我建议看看 ICU 库,它有各种中断迭代器,或者其他工具包(如 Qt),提供一些基本的边界分析。

Generally you need to ask yourself:

  • what exactly do you want to do, what kind of parsing?
  • What languages do you want to support, wide range or Western-European only?
  • What encoding do you want to use UTF-8 or localized 8-bit encoding?
  • What OS are you using?

Lets start, if you work with Western languages with localized 8-bit encoding, then probably yes, you may relay on is*, if locales are installed and configured.

However:

  • if you work with UTF-8 you can't because only ASCII would be covered you can't, because all outside of ASCII takes more then one byte.
  • If you want to support Eastern languages, all your assumptions about parsing would be wrong, like Chinese do not use space to separate words. Most languages even do not have upper or lower case, even alphabet based like Hebrew or Arabic.

So, what exactly do you want to do?

I'd suggest to take a look on ICU library that have various break iterators, or other toolkits like Qt that provide some basic boundary analysis.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文