用于 UTF8 到 1252 的 Windows C API

发布于 2024-08-21 21:55:19 字数 160 浏览 11 评论 0原文

我熟悉 WideCharToMultiByte 和 MultiByteToWideChar 转换,可以使用它们来做类似的事情:

UTF8 -> UTF16-> 1252章

我可能应该引入 iconv 库,但我感觉很懒。

谢谢

I'm familiar with WideCharToMultiByte and MultiByteToWideChar conversions and could use these to do something like:

UTF8 -> UTF16 -> 1252

I know that iconv will do what I need, but does anybody know of any MS libs that will allow this in a single call?

I should probably just pull in the iconv library, but am feeling lazy.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

像极了他 2024-08-28 21:55:19

Windows 1252 基本上等同于 latin-1,又名 ISO-8859-1:Windows-1252 只是在 latin-1 保留范围 128-159 中分配了一些附加字符。如果您准备忽略这些额外的字符并坚持使用 latin-1,那么转换相当容易。试试这个:

#include <stddef.h>

/*  
 * Convert from UTF-8 to latin-1. Invalid encodings, and encodings of
 * code points beyond 255, are replaced by question marks. No more than
 * dst_max_len bytes are stored in the destination array. Returned value
 * is the length that the latin-1 string would have had, assuming a big
 * enough destination buffer.
 */
size_t
utf8_to_latin1(char *src, size_t src_len,
    char *dst, size_t dst_max_len)
{   
    unsigned char *sb;
    size_t u, v;

    u = v = 0;
    sb = (unsigned char *)src;
    while (u < src_len) {
        int c = sb[u ++];
        if (c >= 0x80) {
            if (c >= 0xC0 && c < 0xE0) {
                if (u == src_len) {
                    c = '?';
                } else {
                    int w = sb[u];
                    if (w >= 0x80 && w < 0xC0) {
                        u ++;
                        c = ((c & 0x1F) << 6)
                            + (w & 0x3F);
                    } else {
                        c = '?';
                    }   
                }   
            } else {
                int i;

                for (i = 6; i >= 0; i --)
                    if (!(c & (1 << i)))
                        break;
                c = '?';
                u += i;
            }   
        }   
        if (v < dst_max_len)
            dst[v] = (char)c;
        v ++;
    }   
    return v;
}   

/*  
 * Convert from latin-1 to UTF-8. No more than dst_max_len bytes are
 * stored in the destination array. Returned value is the length that
 * the UTF-8 string would have had, assuming a big enough destination
 * buffer.
 */
size_t
latin1_to_utf8(char *src, size_t src_len,
    char *dst, size_t dst_max_len)
{   
    unsigned char *sb;
    size_t u, v;

    u = v = 0;
    sb = (unsigned char *)src;
    while (u < src_len) {
        int c = sb[u ++];
        if (c < 0x80) {
            if (v < dst_max_len)
                dst[v] = (char)c;
            v ++;
        } else {
            int h = 0xC0 + (c >> 6);
            int l = 0x80 + (c & 0x3F);
            if (v < dst_max_len) {
                dst[v] = (char)h;
                if ((v + 1) < dst_max_len)
                    dst[v + 1] = (char)l;
            }   
            v += 2;
        }   
    }   
    return v;
}   

请注意,我对此代码不保证。这是完全未经测试的。

Windows 1252 is mostly equivalent to latin-1, aka ISO-8859-1: Windows-1252 just has some additional characters allocated in the latin-1 reserved range 128-159. If you are ready to ignore those extra characters, and stick to latin-1, then conversion is rather easy. Try this:

#include <stddef.h>

/*  
 * Convert from UTF-8 to latin-1. Invalid encodings, and encodings of
 * code points beyond 255, are replaced by question marks. No more than
 * dst_max_len bytes are stored in the destination array. Returned value
 * is the length that the latin-1 string would have had, assuming a big
 * enough destination buffer.
 */
size_t
utf8_to_latin1(char *src, size_t src_len,
    char *dst, size_t dst_max_len)
{   
    unsigned char *sb;
    size_t u, v;

    u = v = 0;
    sb = (unsigned char *)src;
    while (u < src_len) {
        int c = sb[u ++];
        if (c >= 0x80) {
            if (c >= 0xC0 && c < 0xE0) {
                if (u == src_len) {
                    c = '?';
                } else {
                    int w = sb[u];
                    if (w >= 0x80 && w < 0xC0) {
                        u ++;
                        c = ((c & 0x1F) << 6)
                            + (w & 0x3F);
                    } else {
                        c = '?';
                    }   
                }   
            } else {
                int i;

                for (i = 6; i >= 0; i --)
                    if (!(c & (1 << i)))
                        break;
                c = '?';
                u += i;
            }   
        }   
        if (v < dst_max_len)
            dst[v] = (char)c;
        v ++;
    }   
    return v;
}   

/*  
 * Convert from latin-1 to UTF-8. No more than dst_max_len bytes are
 * stored in the destination array. Returned value is the length that
 * the UTF-8 string would have had, assuming a big enough destination
 * buffer.
 */
size_t
latin1_to_utf8(char *src, size_t src_len,
    char *dst, size_t dst_max_len)
{   
    unsigned char *sb;
    size_t u, v;

    u = v = 0;
    sb = (unsigned char *)src;
    while (u < src_len) {
        int c = sb[u ++];
        if (c < 0x80) {
            if (v < dst_max_len)
                dst[v] = (char)c;
            v ++;
        } else {
            int h = 0xC0 + (c >> 6);
            int l = 0x80 + (c & 0x3F);
            if (v < dst_max_len) {
                dst[v] = (char)h;
                if ((v + 1) < dst_max_len)
                    dst[v + 1] = (char)l;
            }   
            v += 2;
        }   
    }   
    return v;
}   

Note that I make no guarantee about this code. This is completely untested.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文