扩展“isalnum”识别 UTF-8 元音变音

发布于 2024-12-09 03:27:37 字数 615 浏览 5 评论 0原文

我编写了一个扩展 isalnum 的函数来识别 UTF-8 编码的元音变音。

是否有更优雅的方法来解决这个问题?

代码如下:

bool isalnumlaut(const char character) {
    int cr = (int) (unsigned char) character;
    if (isalnum(character)
            || cr == 195 // UTF-8
            || cr == 132 // Ä
            || cr == 164 // ä
            || cr == 150 // Ö
            || cr == 182 // ö
            || cr == 156 // Ü
            || cr == 188 // ü
            || cr == 159 // ß
    ) {
        return true;
    } else {
        return false;
    }
}

编辑:

我现在多次测试了我的解决方案,但它似乎可以满足我的目的。有强烈反对吗?

I wrote a function which extends isalnum to recognize UTF-8 coded umlaut.

Is there maybe a more elegant way to solve this issue?

The code is as follows:

bool isalnumlaut(const char character) {
    int cr = (int) (unsigned char) character;
    if (isalnum(character)
            || cr == 195 // UTF-8
            || cr == 132 // Ä
            || cr == 164 // ä
            || cr == 150 // Ö
            || cr == 182 // ö
            || cr == 156 // Ü
            || cr == 188 // ü
            || cr == 159 // ß
    ) {
        return true;
    } else {
        return false;
    }
}

EDIT:

I tested my solution now several times, and it seems to do the job for my purpose though. Any strong objections?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

红焚 2024-12-16 03:27:37

您的代码不符合您的要求。

Ä 的 utf-8 表示形式是两个字节 - 0xC3,0x84。值高于 0x7F 的单个字节在 utf-8 中是没有意义的。


一些一般性建议:

  • Unicode 很大。考虑使用已经处理了您遇到的问题的库,例如 ICU

  • 函数在单个代码单元代码点上运行通常没有意义。拥有对代码点范围或单个字形进行操作的函数更有意义(请参阅此处 了解这些术语的定义)。

  • 对于像通用字符集这样大的字符集,您的字母数字概念可能没有得到明确说明;您想将西里尔字母中的字符视为字母数字吗? Unicode 关于字母的概念可能与您的不符 - 特别是如果您还没有考虑过的话。

Your code doesn't do what you're claiming.

The utf-8 representation of Ä is two bytes - 0xC3,0x84. A lone byte with a value above 0x7F is meaningless in utf-8.


Some general suggestions:

  • Unicode is large. Consider using a library that has already handled the issues you're seeing, such as ICU.

  • It doesn't often make sense for a function to operate on a single code unit or code point. It makes much more sense to have functions that operate on either ranges of code points or single glyphs (see here for definitions of those terms).

  • Your concept of alpha-numeric is likely to be underspecified for a character set as large as the Universal Character Set; do you want to treat the characters in the Cyrillic alphabet as alphanumerics? Unicode's concept of what is alphabetic may not match yours - especially if you haven't considered it.

月光色 2024-12-16 03:27:37

我不是 100% 确定,但 中的 C++ std::isalnum 几乎可以肯定识别特定于区域设置的附加字符:http://www.cplusplus.com/reference/std/locale/isalnum/

I'm not 100% sure but the C++ std::isalnum in <locale> almost certainly recognizes locale specific additional characters: http://www.cplusplus.com/reference/std/locale/isalnum/

刘备忘录 2024-12-16 03:27:37

您定义的接口是不可能的,因为 UTF-8 是
多字节编码;单个字符需要多个 char
代表它。 (我有确定 UTF-8 是否是
我的库中指定字符集的成员,但是
字符由一对迭代器指定,而不是单个 char。)

It's impossible with the interface you define, since UTF-8 is a
multibyte encoding; a single character requires multiple char to
represent it. (I've got code for determining whether a UTF-8 is a
member of a specified set of characters in my library, but the
character is specified by a pair of iterators, and not a single char.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文