转换 C++ UTF-8 文字转为向量到 uint32_t 数组 []

发布于 2025-01-12 02:42:19 字数 1971 浏览 1 评论 0原文

我必须承认我的C++编码经验非常少，只有几百行。

我解决了这个问题，但我确信有更好的解决方案。至少我写这篇文章是因为我通过 Google 或 stackoverflow 没有找到解决方案，并且其他用户可能也有类似的问题。

这是 C99 中要移植到 C++11 的代码部分：

// src/levtest.c

char utf_str2[] = "Chſerſplzon";
uint32_t utf_len2 = strlen(utf_str2);

// convert to ucs
uint32_t b_ucs[(utf_len2+1)*4]; // plenty of space
int b_chars;
b_chars = u8_toucs(b_ucs, (utf_len2+1)*4, utf_str2, utf_len2);

int distance;

distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni]      distance: %u expect: 4\n", distance);

背景是 src/levbv.c 中的代码应该通过 Perl XS、C、C++ 工作（也许其他语言绑定，如Python）。它经过高度优化，应该使用 C 类型。需要 vector，因为一个 C++ 发行版（Tesseract-OCR 的训练）使用 vector 来表示相关部分。

这是我对应的 C++ 代码：

// src/levbvcpp.cpp

char utf_str2[] = u8"Chſerſplzon";
    
uint32_t utf_len2 = strlen(utf_str2);
    
// convert u8 to wstring
std::wstring b_string = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(utf_str2);

// convert wstring to vector<wchar_t>
std::vector<wchar_t> b_uv(b_string.begin(), b_string.end());
int b_chars = b_uv.size();

uint32_t b_ucs[(utf_len2+1)*4];
    

// convert vector<wchar_t> to uint32_t array[]
unsigned int index = 0;
for (uint32_t b_char : b_uv) {
    b_ucs[index] = b_char;
    index++;
}

    
int distance;

distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni]      distance: %u expect: 4\n", distance);

源代码位于 https://github.com/wollmers/文本-Levenshtein-BVXS。

提问：

有没有更好的转换方法？
L"こんにちは世界" 或 u"こんにちは世界" 采用什么数据类型和编码？参考手册有点非技术性。
在 MacOS 上，代码使用 -std=c++11 -Wall -g -finput-charset=utf-8 -O3 和 clang 进行编译。在其他平台/编译器上使用 UTF-8 编码的源代码时需要考虑什么吗？在 stackoverflow 上没有找到明确的答案。

原文

I must admit that my experience in coding C++ is very low, some few hundred lines.

I solved the problem but I'm sure there is a better solution. At least I write this because I found no solutions via Google or stackoverflow, and other users maybe have a similar problem.

This is the portion of code in C99 to be ported to C++11:

// src/levtest.c

char utf_str2[] = "Chſerſplzon";
uint32_t utf_len2 = strlen(utf_str2);

// convert to ucs
uint32_t b_ucs[(utf_len2+1)*4]; // plenty of space
int b_chars;
b_chars = u8_toucs(b_ucs, (utf_len2+1)*4, utf_str2, utf_len2);

int distance;

distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni]      distance: %u expect: 4\n", distance);

Background is that the code in src/levbv.c should work via Perl XS, C, C++ (maybe other language bindings like Python). It's highly optimised and should use C-types. vector<wchar_t> is needed, because one C++ distribution (training of Tesseract-OCR) uses vector<wchar_t> for the relevant portions.

Here is my corresponding code in C++:

// src/levbvcpp.cpp

char utf_str2[] = u8"Chſerſplzon";
    
uint32_t utf_len2 = strlen(utf_str2);
    
// convert u8 to wstring
std::wstring b_string = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(utf_str2);

// convert wstring to vector<wchar_t>
std::vector<wchar_t> b_uv(b_string.begin(), b_string.end());
int b_chars = b_uv.size();

uint32_t b_ucs[(utf_len2+1)*4];
    

// convert vector<wchar_t> to uint32_t array[]
unsigned int index = 0;
for (uint32_t b_char : b_uv) {
    b_ucs[index] = b_char;
    index++;
}

    
int distance;

distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni]      distance: %u expect: 4\n", distance);

The source is on https://github.com/wollmers/Text-Levenshtein-BVXS.

Questions:

Is there a better way to convert?
What datatype and encoding would L"こんにちは世界" or u"こんにちは世界" have? The reference manual is somewhat untechnical.
Code is compiled with -std=c++11 -Wall -g -finput-charset=utf-8 -O3 and clang on MacOS. Is there something to consider on other platforms/compilers with source encoded in UTF-8? Did not find a clear answer on stackoverflow.

分享到QQ

分享到微博