扩展“isalnum”识别 UTF-8 元音变音
我编写了一个扩展 isalnum 的函数来识别 UTF-8 编码的元音变音。
是否有更优雅的方法来解决这个问题?
代码如下:
bool isalnumlaut(const char character) {
int cr = (int) (unsigned char) character;
if (isalnum(character)
|| cr == 195 // UTF-8
|| cr == 132 // Ä
|| cr == 164 // ä
|| cr == 150 // Ö
|| cr == 182 // ö
|| cr == 156 // Ü
|| cr == 188 // ü
|| cr == 159 // ß
) {
return true;
} else {
return false;
}
}
编辑:
我现在多次测试了我的解决方案,但它似乎可以满足我的目的。有强烈反对吗?
I wrote a function which extends isalnum
to recognize UTF-8 coded umlaut.
Is there maybe a more elegant way to solve this issue?
The code is as follows:
bool isalnumlaut(const char character) {
int cr = (int) (unsigned char) character;
if (isalnum(character)
|| cr == 195 // UTF-8
|| cr == 132 // Ä
|| cr == 164 // ä
|| cr == 150 // Ö
|| cr == 182 // ö
|| cr == 156 // Ü
|| cr == 188 // ü
|| cr == 159 // ß
) {
return true;
} else {
return false;
}
}
EDIT:
I tested my solution now several times, and it seems to do the job for my purpose though. Any strong objections?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的代码不符合您的要求。
Ä
的 utf-8 表示形式是两个字节 -0xC3,0x84
。值高于 0x7F 的单个字节在 utf-8 中是没有意义的。一些一般性建议:
Unicode 很大。考虑使用已经处理了您遇到的问题的库,例如 ICU。
函数在单个代码单元或代码点上运行通常没有意义。拥有对代码点范围或单个字形进行操作的函数更有意义(请参阅此处 了解这些术语的定义)。
对于像通用字符集这样大的字符集,您的字母数字概念可能没有得到明确说明;您想将西里尔字母中的字符视为字母数字吗? Unicode 关于字母的概念可能与您的不符 - 特别是如果您还没有考虑过的话。
Your code doesn't do what you're claiming.
The utf-8 representation of
Ä
is two bytes -0xC3,0x84
. A lone byte with a value above0x7F
is meaningless in utf-8.Some general suggestions:
Unicode is large. Consider using a library that has already handled the issues you're seeing, such as ICU.
It doesn't often make sense for a function to operate on a single code unit or code point. It makes much more sense to have functions that operate on either ranges of code points or single glyphs (see here for definitions of those terms).
Your concept of alpha-numeric is likely to be underspecified for a character set as large as the Universal Character Set; do you want to treat the characters in the Cyrillic alphabet as alphanumerics? Unicode's concept of what is alphabetic may not match yours - especially if you haven't considered it.
我不是 100% 确定,但
中的 C++std::isalnum
几乎可以肯定识别特定于区域设置的附加字符:http://www.cplusplus.com/reference/std/locale/isalnum/I'm not 100% sure but the C++
std::isalnum
in<locale>
almost certainly recognizes locale specific additional characters: http://www.cplusplus.com/reference/std/locale/isalnum/您定义的接口是不可能的,因为 UTF-8 是
多字节编码;单个字符需要多个
char
来代表它。 (我有确定 UTF-8 是否是
我的库中指定字符集的成员,但是
字符由一对迭代器指定,而不是单个
char
。)It's impossible with the interface you define, since UTF-8 is a
multibyte encoding; a single character requires multiple
char
torepresent it. (I've got code for determining whether a UTF-8 is a
member of a specified set of characters in my library, but the
character is specified by a pair of iterators, and not a single
char
.)