Unicode字符串的跨平台迭代（使用ICU计算字素）

发布于 2024-10-10 04:38:14 字数 2401 浏览 2 评论 0原文

我想要迭代 Unicode 字符串的每个字符，处理每个代理对并将字符序列组合为一个单元（一个字素）。

示例

文本“नमस्ते”由代码点组成：U+0928、U+092E、U+0938、U+094D、U+0924、U+0947，其中 U+0938 和 U+0947 是组合标记。

static void Main(string[] args)
{
    const string s = "नमस्ते";

    Console.WriteLine(s.Length); // Ouptuts "6"

    var l = 0;
    var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
    while(e.MoveNext()) l++;
    Console.WriteLine(l); // Outputs "4"
}

所以我们在 .NET 中拥有了它。我们还有 Win32 的 CharNextW()

#include <Windows.h>
#include <iostream>
#include <string>

int main()
{
    const wchar_t * s = L"नमस्ते";

    std::cout << std::wstring(s).length() << std::endl; // Gives "6"

    int l = 0;
    while(CharNextW(s) != s)
    {
        s = CharNextW(s);
        ++l;
    }

    std::cout << l << std::endl; // Gives "4"

    return 0;
}

问题

我所知道的两种方法都是 Microsoft 特有的。有便携式的方法来做到这一点吗？

我听说过 ICU，但我无法快速找到相关的内容（UnicodeString(s).length() 仍然给出 6）。指向 ICU 中相关功能/模块的答案是可以接受的。
C++ 没有 Unicode 的概念，因此用于处理这些问题的轻量级跨平台库将是一个可以接受的答案。

编辑：使用 ICU 的正确答案

@McDowell 给出了使用 ICU 的 BreakIterator 的提示，我认为这可以被视为处理 Unicode 的事实上的跨平台标准。下面是一个示例代码来演示其用法（因为示例令人惊讶很少见）：

#include <unicode/schriter.h>
#include <unicode/brkiter.h>

#include <iostream>
#include <cassert>
#include <memory>

int main()
{
    const UnicodeString str(L"नमस्ते");

    {
        // StringCharacterIterator doesn't seem to recognize graphemes
        StringCharacterIterator iter(str);
        int count = 0;
        while(iter.hasNext())
        {
            ++count;
            iter.next();
        }
        std::cout << count << std::endl; // Gives "6"
    }

    {
        // BreakIterator works!!
        UErrorCode err = U_ZERO_ERROR;
        std::unique_ptr<BreakIterator> iter(
            BreakIterator::createCharacterInstance(Locale::getDefault(), err));
        assert(U_SUCCESS(err));
        iter->setText(str);

        int count = 0;
        while(iter->next() != BreakIterator::DONE) ++count;
        std::cout << count << std::endl; // Gives "4"
    }

    return 0;
}

原文

I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).

Example

The text "नमस्ते" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947, of which, U+0938 and U+0947 are combining marks.

static void Main(string[] args)
{
    const string s = "नमस्ते";

    Console.WriteLine(s.Length); // Ouptuts "6"

    var l = 0;
    var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
    while(e.MoveNext()) l++;
    Console.WriteLine(l); // Outputs "4"
}

So there we have it in .NET. We also have Win32's CharNextW()

#include <Windows.h>
#include <iostream>
#include <string>

int main()
{
    const wchar_t * s = L"नमस्ते";

    std::cout << std::wstring(s).length() << std::endl; // Gives "6"

    int l = 0;
    while(CharNextW(s) != s)
    {
        s = CharNextW(s);
        ++l;
    }

    std::cout << l << std::endl; // Gives "4"

    return 0;
}

Question

Both ways I know of are specific to Microsoft. Are there portable ways to do it?

I heard about ICU but I couldn't find something related quickly (UnicodeString(s).length() still gives 6). Would be an acceptable answer to point to the related function/module in ICU.
C++ doesn't have a notion of Unicode, so a lightweight cross-platform library for dealing with these issues would make an acceptable answer.

Edit: Correct answer using ICU

@McDowell gave the hint to use BreakIterator from ICU, which I think can be regarded as the de-facto cross-platform standard to deal with Unicode. Here's an example code to demonstrate its use (since examples are surprisingly rare):

#include <unicode/schriter.h>
#include <unicode/brkiter.h>

#include <iostream>
#include <cassert>
#include <memory>

int main()
{
    const UnicodeString str(L"नमस्ते");

    {
        // StringCharacterIterator doesn't seem to recognize graphemes
        StringCharacterIterator iter(str);
        int count = 0;
        while(iter.hasNext())
        {
            ++count;
            iter.next();
        }
        std::cout << count << std::endl; // Gives "6"
    }

    {
        // BreakIterator works!!
        UErrorCode err = U_ZERO_ERROR;
        std::unique_ptr<BreakIterator> iter(
            BreakIterator::createCharacterInstance(Locale::getDefault(), err));
        assert(U_SUCCESS(err));
        iter->setText(str);

        int count = 0;
        while(iter->next() != BreakIterator::DONE) ++count;
        std::cout << count << std::endl; // Gives "4"
    }

    return 0;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毁虫ゝ 2024-10-17 04:38:14

您应该能够使用 ICU BreakIterator （字符实例假设它功能与 Java 版本相同）。

回复收藏 0 原文

孤单情人 2024-10-17 04:38:14

Glib 的 ustring 类为您提供 utf-8 字符串（如果使用 utf-8）对你来说没问题。它的设计类似于std::string。由于 utf-8 是 Linux 原生的，因此您的任务非常简单：

int main()
{
    Glib::ustring s = L"नमस्ते";
    cout << s.size();
}

您还可以像往常一样使用 Glib::ustring::iterator 迭代字符串的字符

Glib's ustring class gives you utf-8 strings, if using utf-8 is ok for you. It is designed to be similar to std::string. Since utf-8 is native for Linux, your task is quite easy:

int main()
{
    Glib::ustring s = L"नमस्ते";
    cout << s.size();
}

you can also iterate on string's characters as usual with Glib::ustring::iterator

回复收藏 0 原文

天赋异禀 2024-10-17 04:38:14

ICU 有一个非常旧的界面，Boost.Locale 更好：

#include <iostream>

#include <string_view>
#include <boost/locale.hpp>
using namespace std::string_view_literals;
int main()

{

    boost::locale::generator gen;

    auto string = "noël

ICU has a very old interface, Boost.Locale is much better:

#include <iostream>
#include <string_view>

#include <boost/locale.hpp>

using namespace std::string_view_literals;

int main()
{
    boost::locale::generator gen;
    auto string = "noël ????????"sv;
    boost::locale::boundary::csegment_index map{
        boost::locale::boundary::character, std::begin(string),
        std::end(string), gen("")};
    for (const auto& i : map)
    {
        std::cout << i << '\n';
    }
}

Text is from here

回复收藏 0 原文

~没有更多了~

关于作者

天邊彩虹

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

Unicode字符串的跨平台迭代（使用ICU计算字素）

示例

问题

编辑：使用 ICU 的正确答案

Example

Question

Edit: Correct answer using ICU

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

Unicode字符串的跨平台迭代（使用ICU计算字素）

示例

问题

编辑：使用 ICU 的正确答案

Example

Question

Edit: Correct answer using ICU

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。