C++将字符串编码为 Unicode - ICU 库
我需要将 ISO-2022-JP 和 ISO-2022-JP-2(以及 ISO-2022 的其他变体)中的一堆字节转换为 Unicode。我正在尝试使用 ICU(链接文本),但以下代码没有工作。
std::string input = "\x1B\x28\x4A" "ABC\xA6\xA7"; //the first 3 chars are escape sequence to use JIS_X201 character set in GL/GR
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
// set up the converter
conv = ucnv_open("ISO-2022-JP", &status);
if (status != U_ZERO_ERROR) return false; //couldn't find character set
UChar * convDest = new UChar[2*input.length()]; //ucnv_toUChars will use up to 2*length
// convert to Unicode
int resultLen = (int)ucnv_toUChars(conv, convDest, 2*input.length(), input.c_str(), input.length(), &status);
这是行不通的。结果包含“?”我输入的任何高于 ASCII 的字符。状态没有错误。我做错了什么?
最重要的是,我在编译 4.4 版库时遇到了麻烦,因为 MSVC 9 项目无法转换为 MSVC 10 项目。
我还知道 libiconv 开源库。我无法在 Windows 上编译该版本。如果有人对不同的库有任何建议,也欢迎。
谢谢。
编辑 我最初使用的转义序列是错误的。所以现在 ICU 获取字符串,去掉转义序列——这是朝着正确方向迈出的一步。但结果仍然包含“?”字符。
EDIT2 我无法转换为 MSVC 10 项目的原因是因为未安装 x64 平台(默认情况下未安装)。或者,我可以在文本编辑器中打开所有项目并删除所有提及的 x64 目标。
I need to convert a bunch of bytes in ISO-2022-JP and ISO-2022-JP-2 (and other variations of ISO-2022) into Unicode. I am trying to use ICU (link text), but the following code doesn't work.
std::string input = "\x1B\x28\x4A" "ABC\xA6\xA7"; //the first 3 chars are escape sequence to use JIS_X201 character set in GL/GR
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
// set up the converter
conv = ucnv_open("ISO-2022-JP", &status);
if (status != U_ZERO_ERROR) return false; //couldn't find character set
UChar * convDest = new UChar[2*input.length()]; //ucnv_toUChars will use up to 2*length
// convert to Unicode
int resultLen = (int)ucnv_toUChars(conv, convDest, 2*input.length(), input.c_str(), input.length(), &status);
This doesn't work. The result contains '?' charcters for anything I put in that was above ASCII. The status has no error. What am I doing wrong?
On top of that I was having trouble compiling the library ver 4.4 as the MSVC 9 project would not convert to MSVC 10 project.
I am also aware of libiconv open source library. I couldn't compile that one on windows. If anyone has any advice on a different library, that's also welcome.
Thanks.
EDIT
The escape sequence I originally used was wrong. So now ICU takes the string, strips out the escape sequence - which is a step in the right direction. But the result still contains '?' chars.
EDIT2 The reason I couldn't convert to MSVC 10 project was because x64 platform wasn't installed (it isn't by default). Alternatively I could open all the projects in text editor and remove all mention of x64 target.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这与 ISO 2022 编码不同。高位应该为零。转义序列看起来有些可识别,但它是从 ESC 开始的。 0x1b,而不是 0xb0。不知道这些字节值的真正含义是什么。
This doesn't resemble an ISO 2022 encoding. The high bits are supposed to be zero. The escape sequence looks somewhat recognizable, but it starts with ESC. 0x1b, not 0xb0. No idea what those byte values really mean.
(这个问题看起来很熟悉,再次嗨。)
一个小小问题:您想使用
if(U_FAILURE(status))
检查错误状态(或者相反,U_SUCCESS(status)< /代码>)。
(This question looks familiar, Hi again.)
A minor, minor nit: You want to check the error status with
if(U_FAILURE(status))
(or conversely,U_SUCCESS(status)
).我无法进行 ISO-2022-JP 编码中 JIS_X201 字符集的转换。我无法使用我可以使用的任何工具生成“有效”的工具 - 尝试了 Java(ISO2022 的 ICU 和非 ICU 实现)和 C++。
所以我基本上只是编写了一个函数来执行代码查找并使用此表转换为 Unicode: wikipedia。
编辑
当我开始填写错误报告时,我想包含 ISO-2022-JP 的 RFC。然后我在 RFC 中找到了这一行“ISO-2022-JP 消息中未使用 JIS X 0201 的假名集”。 链接文本。因此看来该标准实际上并未定义高位。 ISO-2022-JP-3 将映射高位,但映射到低平面。所以我必须取出每个字节并从中减去 0x80,然后通过 ISO-2022-JP-3 传递它,并取其他字节 < 128 并将它们传递给 ISO-2022-JP 转换器以获得完整的 JIS_X201 字符集。嗯,我自己做会容易得多。
所以严格来说我会说这不是一个错误。但这确实是一个非常令人头疼的问题。
PS 我试图解码的整个混乱的流来自 DICOM。请参阅 pdf 第 107 页,了解他们认为可接受的内容。
I couldn't get the conversion to work for JIS_X201 character set in ISO-2022-JP encoding. And I couldn't generate a "valid" one using any tools at my disposal - tried Java (ICU and non ICU implementation of ISO2022) and C++.
So I basically just wrote a function to do a code lookup and convert to Unicode using this table: wikipedia.
EDIT
As I started filling out the bug report I wanted to include the RFC for ISO-2022-JP. Then I found this line in the RFC "The Kana set of JIS X 0201 is not used in ISO-2022-JP messages." link text. So it appears that the standard doesn't actually define the upper bits. The ISO-2022-JP-3 WILL map the upper bits, but to lower plane. So I have to take each byte and subtract 0x80 from it, and pass it through ISO-2022-JP-3, and take the other bytes < 128 and pass them through ISO-2022-JP converter for full JIS_X201 character set. Well it's a lot easier to just do it myself.
So strictly speaking I would say it's not a bug. It's a huge headache though.
P.S. the whole messed up stream that I'm trying to decode comes from DICOM. See pdf page 107 to see what they consider acceptable.