更新一些扫描仪代码以使用 ICU 时出现的问题
我正在开发一个基本的手工编码词法扫描器,并希望支持 UTF-8 输入(现在已经不是 1970 年了!)。输入字符从 stdin
或文件中一次读取一个,然后推送到缓冲区中,直到看到空格等。我考虑为 fgetc()
编写自己的包装器相反,它会返回组成 UTF-8 字符的字节的 char[]
并将结果作为字符串处理...这很容易,但会变得很滑稽。我不想浪费时间重新发明轮子,而是使用现有的、经过测试的库,例如 ICU。现在我有了一个非 UTF-8 支持代码,可以与 fgetc()
、isspace()
、strcmp()
等一起使用我正在尝试更新以使用 ICU。这是我第一次尝试 ICU,并且一直在阅读文档并尝试通过 Google 代码搜索找到使用示例,但仍然存在一些困惑,我希望有人能够澄清。
u_fgetc()
函数返回 UChar
,u_fgetcx()
返回 UChar32
...文档建议使用 < code>u_fgetcx() 来读取代码点,这就是我开始的地方。我保持与上面相同的方法,但我将 UChar32
推入缓冲区而不是 char
。
将字符与已知值进行比较的正确方法是什么?最初,我可以执行 if (c == '+') 来检查是否从输入中获取了加号。当
c
是UChar32
(这是UChar32
和char
之间的比较)时,GCC 不会抱怨,但是这真的正确吗?我能够使用
strcmp()
将缓冲的字符与已知值进行比较,例如if ((strcmp(buf, "else") == 0)
ICU 提供了u_strcmp()
,我想我可能需要使用U_STRING_DECL
和U_STRING_INIT
宏来指定已知的文字,但我不确定它们会导致UChar[]
,尽管我认为我需要UChar32[]
...无论如何,不确定如何正确使用它们。欢迎任何指导。读入一系列数字字符后,我使用
strtol()
转换它们,以便我可以使用它们。自从我现在转换UChar32[]
以来,ICU 是否提供了类似的函数?
I am working on a rudimentary hand-coded lexical scanner and wish to support UTF-8 input (it's not 1970 anymore!). Input characters are read from stdin
or a file one at a time and pushed into a buffer until whitespace is seen, etc. I thought about writing my own wrapper for fgetc()
that would instead return char[]
of bytes that make up the UTF-8 character and work with the result as a string... it'd be easy enough, but would become a slippery-slope. I'd rather not waste time re-inventing the wheel and instead use an existing, tested library like ICU. And so now I have a non-UTF-8 supporting code that works with fgetc()
, isspace()
, strcmp()
, etc. which I am trying to update to use ICU. This is my first foray with ICU and have been reading through the documentation and trying to find usage examples with Google code search, but there are still some points of confusion I'm hoping someone will be able to clarify.
The u_fgetc()
function returns UChar
, and u_fgetcx()
returns UChar32
... the documentation recommends using u_fgetcx()
to read codepoints, so that's what I'm starting with. I'm keeping the same approach as above, but I'm pushing UChar32
s into a buffer instead of char
s.
What is the proper way to compare a character against a known value? Originally I was able to do
if (c == '+')
to check if the plus-sign was fetched from the input. GCC doesn't complain whenc
is aUChar32
(which is then a comparison betweenUChar32
andchar
) but is this really proper?I was able to use
strcmp()
to compare the buffered characters with a known value, for exampleif ((strcmp(buf, "else") == 0)
. There isu_strcmp()
provided by ICU and I think I may need to use theU_STRING_DECL
andU_STRING_INIT
macros to specify the known literal, but I am not certain. The documentation shows they result inUChar[]
, though I assume I needUChar32[]
... and I'm uncertain how to use them correctly anyway. Any guidance here would be welcomed.After reading in a series of numeric characters I have been converting them with
strtol()
so I can work with them. Is there a similar function made available by ICU since I am convertingUChar32[]
now?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
UChar
用于保存代码单元,而UChar32
用于保存代码点。如果您的输入停留在基本多语言平面 (BMP)上,UChar
就足够了,实际上大多数 ICU 函数都在UChar[]
上运行。强烈建议阅读ICU 用户指南,其中解释了大部分内部原理和最佳实践。
将 Unicode 字符变量与已知值进行比较的正确方法是什么?
字符(或
UChar
或UChar32
)只是另一种具有一定宽度和符号的整数类型,可以与具有常见警告和限制的其他整数类型进行比较。至于定义字符值,C99(第 6.4.3 章)提供了通用字符名称表示法:\u
后跟四个十六进制数字,或\U
后跟八个十六进制数字,指定 ISO/IEC 10646“短标识符”。 0x00a0 以下的区域(0x0024'$'
、0x0040'@'
和 0x0060(反引号)除外)被保留(但可以通过转换简单的字符常量来表示)到UChar
)。还保留了从 0xd800 到 0xdfff 的范围(供 UTF-16 使用)。如何定义 Unicode 字符串文字?
U_STRING_DECL
和U_STRING_INIT
确实是您正在寻找的(如上所述,ICU 主要在UChar[]
上运行。)如果您使用的是 C++ 而不是。 C、UNICODE_STRING_SIMPLE
(可选后跟getTermatedBuffer()
再次产生UChar[]
)提供了一种更舒适的定义 Unicode 字符串文字的方法。如何将表示数字的 Unicode 字符串转换为该数字的值?
unum_parse()
及其在unum.h
中的兄弟将为您提供帮助。UChar
is for holding a Code Unit, whileUChar32
is for holding a Code Point. If your input stays on the Basic Multilingual Plane (BMP),UChar
is sufficient, and indeed most ICU functions operate onUChar[]
.Strongly recommended reading is the ICU User Guide, which explains most of the internals and best practices.
What is the proper way to compare a Unicode character variable against a known value?
A character (or
UChar
orUChar32
) is just another integer type with a certain width and signedness, and can be compared to other integer types with the usual caveats and restrictions. As for defining a character value, C99 (chapter 6.4.3) provides Universal character names notation:\u
followed by four hex digits, or\U
followed by eight hex digits, specifying the ISO/IEC 10646 "short identifier". The area below 0x00a0 (with exceptions of 0x0024'$'
, 0x0040'@'
, and 0x0060 (backtick) is reserved (but can be represented by casting a simple character constant toUChar
). Also reserved is the range from 0xd800 through 0xdfff (for use by UTF-16).How to define Unicode string literals?
U_STRING_DECL
andU_STRING_INIT
are indeed what you're looking for. (As written above, ICU mainly operates onUChar[]
.) If you were using C++ instead of C,UNICODE_STRING_SIMPLE
(optionally followed bygetTerminatedBuffer()
to yieldUChar[]
again) provides a much more comfortable way of defining Unicode string literals.How to convert a Unicode string representing a numerical into that numerical's value?
unum_parse()
and its brethren inunum.h
will help you there.加号的 Unicode 值为 U+002B,“+”的正常 (Latin-1) 值也是 0x2B (053, 43)。如果代码集基于 ASCII 或 ISO-8859-x,您编写的内容就足够安全。 C99 标准提供
\u0123
和\U00102345
形式的 Unicode(通用字符名称)(具有 4 个和 8 个十六进制数字),但规定不能指定小于比\u00A0
,例如\u002B
。所以,我认为你写的是正确的。但是,您可以通过使用
枚举
来避免未来的焦虑,例如在适当的标头中定义,并在需要文字加号的任何地方使用。这样,如果您的假设(以及我的假设)是错误的,您可以在一个位置进行编辑 - 标题。
我注意到 ICU 的 Strings 页面建议在应用程序中使用 UTF-32很不寻常。
在纯 C 中,您可能会使用
wcscmp(buf, L"else")
,假设您系统上的wchar_t
相当于uint32_t
和/或 UChar32。似乎有一些方法可以使用UnicodeString
和UNICODE_STRING("...")
后跟ToUTF32()
创建 UTF-32 字符串。可能还有更简洁的方法。有一些“格式化”类可以处理格式化和解析。您可能会使用从
NumberFormat
类派生的类.The Unicode value for PLUS SIGN is U+002B, and the normal (Latin-1) value for '+' is also 0x2B (053, 43). What you wrote is safe enough where the code set is based on ASCII or ISO-8859-x. The C99 standard provides for Unicode (Universal character names) of the forms
\u0123
and\U00102345
(with 4 and 8 hexadecimal digits), but stipulates that you cannot specify values less than\u00A0
, such as\u002B
. So, I think what you wrote is correct.However, you could save yourself future angst by using an
enum
such asdefined in an appropriate header and used whereever you need a literal plus sign. That way, if your assumption (and my assumption) is wrong, you have one place to edit - the header.
I note that the page on Strings with ICU suggests that using UTF-32 in an application is unusual.
In pure C, you'd probably use
wcscmp(buf, L"else")
, assuming that thewchar_t
on your system is equivalent touint32_t
and/or UChar32. There seem to be ways to useUnicodeString
andUNICODE_STRING("...")
followed byToUTF32()
to create a UTF-32 string. There may also be neater ways.There are 'Formatting' classes which handle both formatting and parsing. You would probably use classes derived from the
NumberFormat
class.