C++将字符串编码为 Unicode - ICU 库

发布于 2024-09-19 17:53:51 字数 1182 浏览 5 评论 0原文

我需要将 ISO-2022-JP 和 ISO-2022-JP-2（以及 ISO-2022 的其他变体）中的一堆字节转换为 Unicode。我正在尝试使用 ICU（链接文本），但以下代码没有工作。

std::string input = "\x1B\x28\x4A" "ABC\xA6\xA7";    //the first 3 chars are escape sequence to use JIS_X201 character set in GL/GR
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
// set up the converter
conv = ucnv_open("ISO-2022-JP", &status);
if (status != U_ZERO_ERROR) return false;   //couldn't find character set

UChar * convDest = new UChar[2*input.length()]; //ucnv_toUChars will use up to 2*length

// convert to Unicode
int resultLen = (int)ucnv_toUChars(conv, convDest, 2*input.length(), input.c_str(), input.length(), &status);

这是行不通的。结果包含“？”我输入的任何高于 ASCII 的字符。状态没有错误。我做错了什么？

最重要的是，我在编译 4.4 版库时遇到了麻烦，因为 MSVC 9 项目无法转换为 MSVC 10 项目。

我还知道 libiconv 开源库。我无法在 Windows 上编译该版本。如果有人对不同的库有任何建议，也欢迎。

谢谢。

编辑我最初使用的转义序列是错误的。所以现在 ICU 获取字符串，去掉转义序列——这是朝着正确方向迈出的一步。但结果仍然包含“？”字符。

EDIT2 我无法转换为 MSVC 10 项目的原因是因为未安装 x64 平台（默认情况下未安装）。或者，我可以在文本编辑器中打开所有项目并删除所有提及的 x64 目标。

原文

I need to convert a bunch of bytes in ISO-2022-JP and ISO-2022-JP-2 (and other variations of ISO-2022) into Unicode. I am trying to use ICU (link text), but the following code doesn't work.

std::string input = "\x1B\x28\x4A" "ABC\xA6\xA7";    //the first 3 chars are escape sequence to use JIS_X201 character set in GL/GR
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
// set up the converter
conv = ucnv_open("ISO-2022-JP", &status);
if (status != U_ZERO_ERROR) return false;   //couldn't find character set

UChar * convDest = new UChar[2*input.length()]; //ucnv_toUChars will use up to 2*length

// convert to Unicode
int resultLen = (int)ucnv_toUChars(conv, convDest, 2*input.length(), input.c_str(), input.length(), &status);

This doesn't work. The result contains '?' charcters for anything I put in that was above ASCII. The status has no error. What am I doing wrong?

On top of that I was having trouble compiling the library ver 4.4 as the MSVC 9 project would not convert to MSVC 10 project.

I am also aware of libiconv open source library. I couldn't compile that one on windows. If anyone has any advice on a different library, that's also welcome.

Thanks.

EDIT
The escape sequence I originally used was wrong. So now ICU takes the string, strips out the escape sequence - which is a step in the right direction. But the result still contains '?' chars.

EDIT2 The reason I couldn't convert to MSVC 10 project was because x64 platform wasn't installed (it isn't by default). Alternatively I could open all the projects in text editor and remove all mention of x64 target.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

往日 2024-09-26 17:53:51

这与 ISO 2022 编码不同。高位应该为零。转义序列看起来有些可识别，但它是从 ESC 开始的。 0x1b，而不是 0xb0。不知道这些字节值的真正含义是什么。

回复收藏 0 原文

千鲤 2024-09-26 17:53:51

（这个问题看起来很熟悉，再次嗨。）

一个小小问题：您想使用 if(U_FAILURE(status)) 检查错误状态（或者相反，U_SUCCESS(status)< /代码>）。

回复收藏 0 原文

雨落□心尘 2024-09-26 17:53:51

我无法进行 ISO-2022-JP 编码中 JIS_X201 字符集的转换。我无法使用我可以使用的任何工具生成“有效”的工具 - 尝试了 Java（ISO2022 的 ICU 和非 ICU 实现）和 C++。

所以我基本上只是编写了一个函数来执行代码查找并使用此表转换为 Unicode： wikipedia。

编辑
当我开始填写错误报告时，我想包含 ISO-2022-JP 的 RFC。然后我在 RFC 中找到了这一行“ISO-2022-JP 消息中未使用 JIS X 0201 的假名集”。链接文本。因此看来该标准实际上并未定义高位。 ISO-2022-JP-3 将映射高位，但映射到低平面。所以我必须取出每个字节并从中减去 0x80，然后通过 ISO-2022-JP-3 传递它，并取其他字节 < 128 并将它们传递给 ISO-2022-JP 转换器以获得完整的 JIS_X201 字符集。嗯，我自己做会容易得多。

所以严格来说我会说这不是一个错误。但这确实是一个非常令人头疼的问题。

PS 我试图解码的整个混乱的流来自 DICOM。请参阅 pdf 第 107 页，了解他们认为可接受的内容。