如何在 C++ 中使用 libiconv 将 ISO-8859-1 转换为 UTF-8
我正在使用 libcurl 来获取一些 HTML 页面。
HTML 页面包含一些字符引用,例如: סלקום
当我使用 libxml2 阅读此内容时,我得到:
是 ISO-8859-1 编码吗?
如果是这样,我如何将其转换为 UTF-8 以获得正确的单词。
谢谢
编辑:我得到了解决方案,MSalters 是对的,libxml2 确实使用 UTF-8。
我将其添加到 eclipse.ini
-Dfile.encoding=utf-8
中,最后我在 Eclipse 控制台上看到了希伯来语字符。 谢谢
I'm using libcurl to fetch some HTML pages.
The HTML pages contain some character references like: סלקום
When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨
is it the ISO-8859-1 encoding?
If so, how do I convert it to UTF-8 to get the correct word.
Thanks
EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8.
I added this to eclipse.ini
-Dfile.encoding=utf-8
and finally I got Hebrew characters on my Eclipse console.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您看过 i18n 上的 libxml2 页面 吗?它解释了 libxml2 如何解决这些问题。
您将会从libxml2得到一个
ס
。但是,您说您得到了类似׳₪׳ě׳�׳ ׳ě
的内容。你为什么认为你得到了那个?您将获得一个XMLchar*
。你是如何将该指针转换为上面的字符串的?您可能使用过调试器吗?该调试器是否知道如何呈现XMLchar*
?我敢打赌XMLchar*
是正确的,但您使用的调试器无法在XMLchar*
中呈现 Unicode要回答您的最后一个问题,
XMLchar*
已经是UTF-8,不需要进一步转换。Have you seen the libxml2 page on i18n ? It explains how libxml2 solves these problems.
You will get a
ס
from libxml2. However, you said that you get something like׳₪׳¨׳˜׳ ׳¨
. Why do you think that you got that? You get anXMLchar*
. How did you convert that pointer into the string above? Did you perhaps use a debugger? Does that debugger know how to render aXMLchar*
? My bet is that theXMLchar*
is correct, but you used a debugger that cannot render the Unicode in aXMLchar*
To answer your last question, a
XMLchar*
is already UTF-8 and needs no further conversion.不会。这些实体对应于字符的 Unicode 序列号的十进制值。例如,请参阅此页面。
因此,您可以将 Unicode 值存储为
int
egers,并使用算法将这些整数转换为 UTF-8 多字节字符。请参阅 UTF-8 规范。No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. See this page for example.
You can therefore store your Unicode values as
int
egers and use an algorithm to transform those integers to an UTF-8 multibyte character. See UTF-8 specification for this.这个答案是在假设编码文本以 UTF-16 形式返回的情况下给出的,但事实证明并非如此。
我猜测编码是 UTF-16 或 UCS2。将此指定为 iconv 的输入。也可能存在字节序问题,请查看 here
c 风格的方式是(不检查清晰度):
This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case.
I would guess the encoding is UTF-16 or UCS2. Specify this as input for iconv. There might also be an endian issue, have a look here
The c-style way would be (no checking for clarity):