使用 ICU 去除变音符号的代码

发布于 2024-09-04 09:32:24 字数 343 浏览 14 评论 0原文

有人可以提供一些示例代码来去除变音符号（即，将具有重音、变音符号等的字符替换为不重音、不变音等的字符等价物，例如，每个重音 é 都会变成使用 C++ 中的 ICU 库从 UnicodeString 获取纯 ASCII e）？例如：

UnicodeString strip_diacritics( UnicodeString const &s ) {
    UnicodeString result;
    // ...
    return result;
}

假设s已经被标准化。谢谢。

原文

Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é would become a plain ASCII e) from a UnicodeString using the ICU library in C++? E.g.:

UnicodeString strip_diacritics( UnicodeString const &s ) {
    UnicodeString result;
    // ...
    return result;
}

Assume that s has already been normalized. Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

断念 2024-09-11 09:32:24

ICU 允许您使用特定规则音译字符串。我的规则是 NFD； [:M:] 删除； NFC：分解、删除变音符号、重新组合。以下代码采用 UTF-8 std::string 作为输入并返回另一个 UTF-8 std::string：

#include <unicode/utypes.h>
#include <unicode/unistr.h>
#include <unicode/translit.h>

std::string desaxUTF8(const std::string& str) {
    // UTF-8 std::string -> UTF-16 UnicodeString
    UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));

    // Transliterate UTF-16 UnicodeString
    UErrorCode status = U_ZERO_ERROR;
    Transliterator *accentsConverter = Transliterator::createInstance(
        "NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
    accentsConverter->transliterate(source);
    // TODO: handle errors with status

    // UTF-16 UnicodeString -> UTF-8 std::string
    std::string result;
    source.toUTF8String(result);

    return result;
}

ICU lets you transliterate a string using a specific rule. My rule is NFD; [:M:] Remove; NFC: decompose, remove diacritics, recompose. The following code takes an UTF-8 std::string as an input and returns another UTF-8 std::string:

#include <unicode/utypes.h>
#include <unicode/unistr.h>
#include <unicode/translit.h>

std::string desaxUTF8(const std::string& str) {
    // UTF-8 std::string -> UTF-16 UnicodeString
    UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));

    // Transliterate UTF-16 UnicodeString
    UErrorCode status = U_ZERO_ERROR;
    Transliterator *accentsConverter = Transliterator::createInstance(
        "NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
    accentsConverter->transliterate(source);
    // TODO: handle errors with status

    // UTF-16 UnicodeString -> UTF-8 std::string
    std::string result;
    source.toUTF8String(result);

    return result;
}

回复收藏 0 原文

独自唱情﹋歌 2024-09-11 09:32:24

在其他地方进行更多搜索后：

UErrorCode status = U_ZERO_ERROR;
UnicodeString result;

// 's16' is the UTF-16 string to have diacritics removed
Normalizer::normalize( s16, UNORM_NFKD, 0, result, status );
if ( U_FAILURE( status ) )
  // complain

// code to convert UTF-16 's16' to UTF-8 std::string 's8' elided

string buf8;
buf8.reserve( s8.length() );
for ( string::const_iterator i = s8.begin(); i != s8.end(); ++i ) {
  char const c = *i;
  if ( isascii( c ) )
    buf8.push_back( c );
}
// result is in buf8

这是 O(n) 。

After more searching elsewhere:

UErrorCode status = U_ZERO_ERROR;
UnicodeString result;

// 's16' is the UTF-16 string to have diacritics removed
Normalizer::normalize( s16, UNORM_NFKD, 0, result, status );
if ( U_FAILURE( status ) )
  // complain

// code to convert UTF-16 's16' to UTF-8 std::string 's8' elided

string buf8;
buf8.reserve( s8.length() );
for ( string::const_iterator i = s8.begin(); i != s8.end(); ++i ) {
  char const c = *i;
  if ( isascii( c ) )
    buf8.push_back( c );
}
// result is in buf8

which is O(n).

回复收藏 0 原文

~没有更多了~