Unicode字符串的跨平台迭代(使用ICU计算字素)
我想要迭代 Unicode 字符串的每个字符,处理每个代理对并将字符序列组合为一个单元(一个字素)。
示例
文本“नमस्ते”由代码点组成:U+0928、U+092E、U+0938、U+094D、U+0924、U+0947
,其中 U+0938
和 U+0947
是组合标记。
static void Main(string[] args)
{
const string s = "नमस्ते";
Console.WriteLine(s.Length); // Ouptuts "6"
var l = 0;
var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while(e.MoveNext()) l++;
Console.WriteLine(l); // Outputs "4"
}
所以我们在 .NET 中拥有了它。我们还有 Win32 的 CharNextW()
#include <Windows.h>
#include <iostream>
#include <string>
int main()
{
const wchar_t * s = L"नमस्ते";
std::cout << std::wstring(s).length() << std::endl; // Gives "6"
int l = 0;
while(CharNextW(s) != s)
{
s = CharNextW(s);
++l;
}
std::cout << l << std::endl; // Gives "4"
return 0;
}
问题
我所知道的两种方法都是 Microsoft 特有的。有便携式的方法来做到这一点吗?
- 我听说过 ICU,但我无法快速找到相关的内容(
UnicodeString(s).length()
仍然给出 6)。指向 ICU 中相关功能/模块的答案是可以接受的。 - C++ 没有 Unicode 的概念,因此用于处理这些问题的轻量级跨平台库将是一个可以接受的答案。
编辑:使用 ICU 的正确答案
@McDowell 给出了使用 ICU 的 BreakIterator
的提示,我认为这可以被视为处理 Unicode 的事实上的跨平台标准。下面是一个示例代码来演示其用法(因为示例令人惊讶很少见):
#include <unicode/schriter.h>
#include <unicode/brkiter.h>
#include <iostream>
#include <cassert>
#include <memory>
int main()
{
const UnicodeString str(L"नमस्ते");
{
// StringCharacterIterator doesn't seem to recognize graphemes
StringCharacterIterator iter(str);
int count = 0;
while(iter.hasNext())
{
++count;
iter.next();
}
std::cout << count << std::endl; // Gives "6"
}
{
// BreakIterator works!!
UErrorCode err = U_ZERO_ERROR;
std::unique_ptr<BreakIterator> iter(
BreakIterator::createCharacterInstance(Locale::getDefault(), err));
assert(U_SUCCESS(err));
iter->setText(str);
int count = 0;
while(iter->next() != BreakIterator::DONE) ++count;
std::cout << count << std::endl; // Gives "4"
}
return 0;
}
I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).
Example
The text "नमस्ते" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947
, of which, U+0938
and U+0947
are combining marks.
static void Main(string[] args)
{
const string s = "नमस्ते";
Console.WriteLine(s.Length); // Ouptuts "6"
var l = 0;
var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while(e.MoveNext()) l++;
Console.WriteLine(l); // Outputs "4"
}
So there we have it in .NET. We also have Win32's CharNextW()
#include <Windows.h>
#include <iostream>
#include <string>
int main()
{
const wchar_t * s = L"नमस्ते";
std::cout << std::wstring(s).length() << std::endl; // Gives "6"
int l = 0;
while(CharNextW(s) != s)
{
s = CharNextW(s);
++l;
}
std::cout << l << std::endl; // Gives "4"
return 0;
}
Question
Both ways I know of are specific to Microsoft. Are there portable ways to do it?
- I heard about ICU but I couldn't find something related quickly (
UnicodeString(s).length()
still gives 6). Would be an acceptable answer to point to the related function/module in ICU. - C++ doesn't have a notion of Unicode, so a lightweight cross-platform library for dealing with these issues would make an acceptable answer.
Edit: Correct answer using ICU
@McDowell gave the hint to use BreakIterator
from ICU, which I think can be regarded as the de-facto cross-platform standard to deal with Unicode. Here's an example code to demonstrate its use (since examples are surprisingly rare):
#include <unicode/schriter.h>
#include <unicode/brkiter.h>
#include <iostream>
#include <cassert>
#include <memory>
int main()
{
const UnicodeString str(L"नमस्ते");
{
// StringCharacterIterator doesn't seem to recognize graphemes
StringCharacterIterator iter(str);
int count = 0;
while(iter.hasNext())
{
++count;
iter.next();
}
std::cout << count << std::endl; // Gives "6"
}
{
// BreakIterator works!!
UErrorCode err = U_ZERO_ERROR;
std::unique_ptr<BreakIterator> iter(
BreakIterator::createCharacterInstance(Locale::getDefault(), err));
assert(U_SUCCESS(err));
iter->setText(str);
int count = 0;
while(iter->next() != BreakIterator::DONE) ++count;
std::cout << count << std::endl; // Gives "4"
}
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您应该能够使用 ICU BreakIterator (字符实例假设它功能与 Java 版本相同)。
You should be able to use the ICU BreakIterator for this (the character instance assuming it is feature-equivalent to the Java version).
Glib 的 ustring 类为您提供 utf-8 字符串(如果使用 utf-8)对你来说没问题。它的设计类似于
std::string
。由于 utf-8 是 Linux 原生的,因此您的任务非常简单:您还可以像往常一样使用
Glib::ustring::iterator
迭代字符串的字符Glib's ustring class gives you utf-8 strings, if using utf-8 is ok for you. It is designed to be similar to
std::string
. Since utf-8 is native for Linux, your task is quite easy:you can also iterate on string's characters as usual with
Glib::ustring::iterator
ICU 有一个非常旧的界面,Boost.Locale 更好:
ICU has a very old interface, Boost.Locale is much better:
Text is from here