转换 C++ UTF-8 文字转为向量到 uint32_t 数组 []
我必须承认我的C++编码经验非常少,只有几百行。
我解决了这个问题,但我确信有更好的解决方案。至少我写这篇文章是因为我通过 Google 或 stackoverflow 没有找到解决方案,并且其他用户可能也有类似的问题。
这是 C99 中要移植到 C++11 的代码部分:
// src/levtest.c
char utf_str2[] = "Chſerſplzon";
uint32_t utf_len2 = strlen(utf_str2);
// convert to ucs
uint32_t b_ucs[(utf_len2+1)*4]; // plenty of space
int b_chars;
b_chars = u8_toucs(b_ucs, (utf_len2+1)*4, utf_str2, utf_len2);
int distance;
distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni] distance: %u expect: 4\n", distance);
背景是 src/levbv.c
中的代码应该通过 Perl XS、C、C++ 工作(也许其他语言绑定,如Python)。它经过高度优化,应该使用 C 类型。需要 vector
,因为一个 C++ 发行版(Tesseract-OCR 的训练)使用 vector
来表示相关部分。
这是我对应的 C++ 代码:
// src/levbvcpp.cpp
char utf_str2[] = u8"Chſerſplzon";
uint32_t utf_len2 = strlen(utf_str2);
// convert u8 to wstring
std::wstring b_string = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(utf_str2);
// convert wstring to vector<wchar_t>
std::vector<wchar_t> b_uv(b_string.begin(), b_string.end());
int b_chars = b_uv.size();
uint32_t b_ucs[(utf_len2+1)*4];
// convert vector<wchar_t> to uint32_t array[]
unsigned int index = 0;
for (uint32_t b_char : b_uv) {
b_ucs[index] = b_char;
index++;
}
int distance;
distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni] distance: %u expect: 4\n", distance);
源代码位于 https://github.com/wollmers/文本-Levenshtein-BVXS。
提问:
- 有没有更好的转换方法?
L"こんにちは世界"
或u"こんにちは世界"
采用什么数据类型和编码?参考手册有点非技术性。- 在 MacOS 上,代码使用
-std=c++11 -Wall -g -finput-charset=utf-8 -O3
和clang
进行编译。在其他平台/编译器上使用 UTF-8 编码的源代码时需要考虑什么吗?在 stackoverflow 上没有找到明确的答案。
I must admit that my experience in coding C++ is very low, some few hundred lines.
I solved the problem but I'm sure there is a better solution. At least I write this because I found no solutions via Google or stackoverflow, and other users maybe have a similar problem.
This is the portion of code in C99 to be ported to C++11:
// src/levtest.c
char utf_str2[] = "Chſerſplzon";
uint32_t utf_len2 = strlen(utf_str2);
// convert to ucs
uint32_t b_ucs[(utf_len2+1)*4]; // plenty of space
int b_chars;
b_chars = u8_toucs(b_ucs, (utf_len2+1)*4, utf_str2, utf_len2);
int distance;
distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni] distance: %u expect: 4\n", distance);
Background is that the code in src/levbv.c
should work via Perl XS, C, C++ (maybe other language bindings like Python). It's highly optimised and should use C-types. vector<wchar_t>
is needed, because one C++ distribution (training of Tesseract-OCR) uses vector<wchar_t>
for the relevant portions.
Here is my corresponding code in C++:
// src/levbvcpp.cpp
char utf_str2[] = u8"Chſerſplzon";
uint32_t utf_len2 = strlen(utf_str2);
// convert u8 to wstring
std::wstring b_string = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(utf_str2);
// convert wstring to vector<wchar_t>
std::vector<wchar_t> b_uv(b_string.begin(), b_string.end());
int b_chars = b_uv.size();
uint32_t b_ucs[(utf_len2+1)*4];
// convert vector<wchar_t> to uint32_t array[]
unsigned int index = 0;
for (uint32_t b_char : b_uv) {
b_ucs[index] = b_char;
index++;
}
int distance;
distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni] distance: %u expect: 4\n", distance);
The source is on https://github.com/wollmers/Text-Levenshtein-BVXS.
Questions:
- Is there a better way to convert?
- What datatype and encoding would
L"こんにちは世界"
oru"こんにちは世界"
have? The reference manual is somewhat untechnical. - Code is compiled with
-std=c++11 -Wall -g -finput-charset=utf-8 -O3
andclang
on MacOS. Is there something to consider on other platforms/compilers with source encoded in UTF-8? Did not find a clear answer on stackoverflow.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论