C 中带有重音字符的 isLetter
我想创建(或查找)一个 C 函数来检查 char c 是否是一个字母... 当然,我可以轻松地为 az 和 AZ 执行此操作。
但是,如果测试 c == á,ã,ô,ç,ë 等,我会收到错误
可能这些特殊字符存储在比字符更多的位置...
我想知道: 这些特殊字符是如何存储的,我的函数需要接收哪些参数,以及如何接收? 我还想知道是否有任何标准函数已经做到了这一点。
I'd like to create (or find) a C function to check if a char c is a letter...
I can do this for a-z and A-Z easily of course.
However i get an error if testing c == á,ã,ô,ç,ë, etc
Probably those special characters are stored in more then a char...
I'd like to know:
How these special characters are stored, which arguments my function needs to receive, and how to do it?
I'd also like to know if are there any standard function that already does this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为您正在寻找
iswalpha()
例程:它确实取决于当前
locale(7)
的LC_CTYPE
,因此它在应该同时正确处理多种类型输入的程序中使用可能并不理想。I think you're looking for the
iswalpha()
routine:It does depend upon the
LC_CTYPE
of the currentlocale(7)
, so its use in a program that is supposed to handle multiple types of input correctly simultaneously might not be ideal.如果您正在使用单字节代码集,例如 ISO 8859-1 或 8859-15(或任何其他 8859-x 代码集),则
isalpha()
函数将完成这项工作,如果您还记得在程序中使用setlocale(LC_ALL, "");
(或其他合适的setlocale()
调用)。如果没有这个,程序将在 C 语言环境中运行,该语言环境仅对 ASCII 字符(0x00..0x7F 范围内的 8859-x 字符)进行分类。如果您使用多字节或宽字符代码集(例如 UTF8 或 UTF16),则需要查看
和
。If you are working with single-byte codesets such as ISO 8859-1 or 8859-15 (or any of the other 8859-x codesets), then the
isalpha()
function will do the job if you also remember to usesetlocale(LC_ALL, "");
(or some other suitable invocation ofsetlocale()
) in your program. Without this, the program runs in the C locale, which only classifies the ASCII characters (8859-x characters in the range 0x00..0x7F).If you are working with multibyte or wide character codesets (such as UTF8 or UTF16), then you need to look to the wide character functions found in
<wchar.h>
and<wctype.h>
.这些字符的存储方式取决于区域设置。在大多数 UNIX 系统上,它们将存储为 UTF8,而 Win32 计算机可能会将它们表示为 UTF16。 UTF8 存储为可变数量的字符,而 UTF16 则使用代理项对存储 - 因此位于 wchar_t (或 unsigned Short)内(不过顺便说一句,Windows 上的 sizeof(wchar_t) 只有 2(而 *nix 上为 4),因此,如果使用代理对编码(在很多情况下都会如此),您通常需要 2 个 wchar_t 类型来存储 1 个字符。
如前所述,
iswalpha()
例程将为您执行此操作,并记录在这里。它应该为您处理特定于区域设置的问题。How these characters are stored is locale-dependent. On most UNIX systems, they'll be stored as UTF8, whereas a Win32 machine will likely represent them as UTF16. UTF8 is stored as a variable-amount of chars, whereas UTF16 is stored using surrogate pairs - and thus inside a wchar_t (or unsigned short) (though incidentally, sizeof(wchar_t) on Windows is only 2 (vs 4 on *nix), and thus you'll often need 2 wchar_t types to store the 1 character if a surrogate pair encoding is used - which it will be in many cases).
As was mentioned, the
iswalpha()
routine will do this for you, and is documented here. It should take care of locale-specific issues for you.您可能需要 http://site.icu-project.org/。它提供了一个带有 API 的可移植库。
You probably want http://site.icu-project.org/. It provides a portable library with APIs for this.