UTF-8 到 Unicode 转换

发布于 2024-10-12 09:38:54 字数 1642 浏览 7 评论 0原文

我在将 UTF-8 转换为 Unicode 时遇到问题。

下面是代码：

int charset_convert( char * string, char * to_string,char* charset_from, char* charset_to)
{
    char *from_buf, *to_buf, *pointer;
    size_t inbytesleft, outbytesleft, ret;
    size_t TotalLen;
    iconv_t cd;

    if (!charset_from || !charset_to || !string) /* sanity check */
        return -1;

    if (strlen(string) < 1)
        return 0; /* we are done, nothing to convert */

    cd = iconv_open(charset_to, charset_from);
    /* Did I succeed in getting a conversion descriptor ? */
    if (cd == (iconv_t)(-1)) {
        /* I guess not */
        printf("Failed to convert string from %s to %s ",
              charset_from, charset_to);
        return -1;
    }
    from_buf = string;
    inbytesleft = strlen(string);
    /* allocate max sized buffer, 
       assuming target encoding may be 4 byte unicode */
    outbytesleft = inbytesleft *4 ;
    pointer = to_buf = (char *)malloc(outbytesleft);
    memset(to_buf,0,outbytesleft);
    memset(pointer,0,outbytesleft);

        ret = iconv(cd, &from_buf, &inbytesleft, &pointer, &outbytesleft);ing
    memcpy(to_string,to_buf,(pointer-to_buf);
}

main()：

int main()
{    
    char  UTF []= {'A', 'B'};
    char  Unicode[1024]= {0};
    char* ptr;
    int x=0;
    iconv_t cd;

    charset_convert(UTF,Unicode,"UTF-8","UNICODE");

    ptr = Unicode;

    while(*ptr != '\0')
    {   
        printf("Unicode %x \n",*ptr);
        ptr++;
    }
    return 0;
}

它应该给出 A 和 B 但我得到：

ffffffff
fffffffe
41

谢谢，桑迪普

原文

I am having problems with converting UTF-8 to Unicode.

Below is the code:

int charset_convert( char * string, char * to_string,char* charset_from, char* charset_to)
{
    char *from_buf, *to_buf, *pointer;
    size_t inbytesleft, outbytesleft, ret;
    size_t TotalLen;
    iconv_t cd;

    if (!charset_from || !charset_to || !string) /* sanity check */
        return -1;

    if (strlen(string) < 1)
        return 0; /* we are done, nothing to convert */

    cd = iconv_open(charset_to, charset_from);
    /* Did I succeed in getting a conversion descriptor ? */
    if (cd == (iconv_t)(-1)) {
        /* I guess not */
        printf("Failed to convert string from %s to %s ",
              charset_from, charset_to);
        return -1;
    }
    from_buf = string;
    inbytesleft = strlen(string);
    /* allocate max sized buffer, 
       assuming target encoding may be 4 byte unicode */
    outbytesleft = inbytesleft *4 ;
    pointer = to_buf = (char *)malloc(outbytesleft);
    memset(to_buf,0,outbytesleft);
    memset(pointer,0,outbytesleft);

        ret = iconv(cd, &from_buf, &inbytesleft, &pointer, &outbytesleft);ing
    memcpy(to_string,to_buf,(pointer-to_buf);
}

main():

int main()
{    
    char  UTF []= {'A', 'B'};
    char  Unicode[1024]= {0};
    char* ptr;
    int x=0;
    iconv_t cd;

    charset_convert(UTF,Unicode,"UTF-8","UNICODE");

    ptr = Unicode;

    while(*ptr != '\0')
    {   
        printf("Unicode %x \n",*ptr);
        ptr++;
    }
    return 0;
}

It should give A and B but i am getting:

ffffffff
fffffffe
41

Thanks,
Sandeep

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泪意 2024-10-19 09:38:54

看起来你正在以小端格式输出 UTF-16：

ff fe 41 00 ...

这是 U+FEFF （ZWNBSP 又名字节顺序标记），U+0041 （拉丁大写字母 A），...

然后你停止打印，因为你的 while 循环已在第一个空字节处终止。以下字节应为：42 00。

您应该从函数返回一个长度，或者确保输出以空字符 (U+0000) 终止并循环，直到找到它。

It looks like you are getting UTF-16 out in a little endian format:

ff fe 41 00 ...

Which is U+FEFF (ZWNBSP aka byte order mark), U+0041 (latin capital letter A), ...

You then stop printing because your while loop has terminated on the first null byte. The following bytes should be: 42 00.

You should either return a length from your function or make sure that the output is terminated with a null character (U+0000) and loop until you find this.

回复收藏 0 原文

仲春光 2024-10-19 09:38:54

UTF-8 是统一码。

除非您需要其他类型的 Unicode 编码（例如 UTF-16 或 UTF-32），否则不需要隐藏

回复收藏 0 原文

昇り龍 2024-10-19 09:38:54

UTF 不是 Unicode。 UTF 是 Unicode 标准中整数的编码。这个问题本身是没有意义的。如果你的意思是你想从（任何）UTF 转换为 unicode 代码点（即代表分配的代码点的整数，大致是一个字符），那么你需要做一些阅读，但它涉及位移位有关 UTF-8 字节序列中 1、2、3 或 4 个字节的值（请参阅 Wikipedia< /a>，而 Markus Kuhn 的文本也很棒）

回复收藏 0 原文

梦里梦着梦中梦 2024-10-19 09:38:54

除非我遗漏了一些东西，但没有人指出，“UNICODE”在 libiconv 中不是有效的编码名称，因为它是一系列编码的名称。

http://www.gnu.org/software/libiconv/

（编辑）实际上 iconv -l 将 UNICODE 显示为列出的条目，但没有详细信息，在源代码中，其在注释中列为 UNICODE-LITTLE 的别名，但在子注释中提到：

 * UNICODE (big endian), UNICODEFEFF (little endian)
   We DON'T implement these because they are stupid and not standardized.

在别名头文件中 UNICODELITTLE （无连字符）解析如下：

lib/aliases.gperf:UNICODELITTLE, ei_ucs2le

即UCS2-LE（UTF-16 Little Endian），它应该与Windows内部“Unicode”编码匹配。

http://en.wikipedia.org/wiki/UTF-16/UCS-2

但是，明确建议您显式指定 UCS2-LE 或 UCS2-BE，除非第一个字节是字节顺序标记 (BOM) 值 0xfeff 指示字节顺序方案。

=>您将 BOM 视为输出的第一个字节，因为这就是“UNICODE”编码名称的含义，它意味着带有指示字节顺序方案的标头的 UCS2。

Unless I am missing something as nobody has pointed it out yet, "UNICODE" isn't a valid encoding name in libiconv as it is the name of a family of encodings.

http://www.gnu.org/software/libiconv/

(edit) Actually iconv -l shows UNICODE as a listed entry but no details, in the source code its listed in the notes as an alias for UNICODE-LITTLE but in the subnotes it mentions:

 * UNICODE (big endian), UNICODEFEFF (little endian)
   We DON'T implement these because they are stupid and not standardized.

In the aliases header files UNICODELITTLE (no hyphen) resolves as follows:

lib/aliases.gperf:UNICODELITTLE, ei_ucs2le

i.e. UCS2-LE (UTF-16 Little Endian), which should match Windows internal "Unicode" encoding.

http://en.wikipedia.org/wiki/UTF-16/UCS-2

However you are clearly recommended to explicitly specify UCS2-LE or UCS2-BE unless the first bytes are a Byte Order Mark (BOM) value 0xfeff to indicate byte order scheme.

=> You are seeing the BOM as the first bytes of the output because that is what the "UNICODE" encoding name means, it means UCS2 with a header indicating the byte order scheme.

回复收藏 0 原文

~没有更多了~