将混合编码的xml文件数据保存到mysql数据库中的utf-8

发布于 2024-10-01 07:39:24 字数 2782 浏览 1 评论 0原文

我有一个混合编码的 xml 文件(虽然文件以 iso-8859-1 编码表示),但也包含来自 windows 1252 的字符(商标符号、endash 等)

我使用 PHP 和 xmlreader 解析 xml 文件以保存在数据库中。 MySQL 5.0 服务器将混合编码字符保存为方框字符,但 MySQL 5.1 给出错误。

那么问题来了,正确保存utf-8数据最简单、最充分的证明方法是什么。

这是我当前将其转换为 utf-8 的代码,只是想知道转换时是否会产生问题?

 function cp1252_to_utf8($str) 
    {   
       $cp1252_map = array(
                "\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
                "\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
                "\xc2\x83" => "\xc6\x92",     /* LATIN SMALL LETTER F WITH HOOK */
                "\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
                "\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
                "\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
                "\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
                "\xc2\x88" => "\xcb\x86",     /* MODIFIER LETTER CIRCUMFLEX ACCENT */
                "\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
                "\xc2\x8a" => "\xc5\xa0",     /* LATIN CAPITAL LETTER S WITH CARON */
                "\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
                "\xc2\x8c" => "\xc5\x92",     /* LATIN CAPITAL LIGATURE OE */
                "\xc2\x8e" => "\xc5\xbd",     /* LATIN CAPITAL LETTER Z WITH CARON */
                "\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
                "\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
                "\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
                "\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
                "\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
                "\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
                "\xc2\x97" => "\xe2\x80\x94", /* EM DASH */

                "\xc2\x98" => "\xcb\x9c",     /* SMALL TILDE */
                "\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
                "\xc2\x9a" => "\xc5\xa1",     /* LATIN SMALL LETTER S WITH CARON */
                "\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
                "\xc2\x9c" => "\xc5\x93",     /* LATIN SMALL LIGATURE OE */
                "\xc2\x9e" => "\xc5\xbe",     /* LATIN SMALL LETTER Z WITH CARON */
                "\xc2\x9f" => "\xc5\xb8"      /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
            );

            return  strtr(utf8_encode($str), $cp1252_map);
    }


    $sql='SET NAMES "utf8" COLLATE "utf8_swedish_ci"';
    mysql_query($sql);


    $arr_book["booktitle"] = cp1252_to_utf8( iconv("UTF-8", "ISO-8859-1//TRANSLIT", $arr_book["

booktitle"]));

I have an xml file with mixed encoding (file is to be said in iso-8859-1 encoding though) but contain characters from windows 1252 also (trademark symbol, endash etc)

Im using PHP and xmlreader to parse xml file to save in database. MySQL 5.0 server is saving the mixed encoded characters as box character but MySQL 5.1 gives error.

so the question is, what is the easiest and full proof method to correctly save the utf-8 data.

This is my current code to convert it to utf-8, just wanted to know, if it may create problem while converting?

 function cp1252_to_utf8($str) 
    {   
       $cp1252_map = array(
                "\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
                "\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
                "\xc2\x83" => "\xc6\x92",     /* LATIN SMALL LETTER F WITH HOOK */
                "\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
                "\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
                "\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
                "\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
                "\xc2\x88" => "\xcb\x86",     /* MODIFIER LETTER CIRCUMFLEX ACCENT */
                "\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
                "\xc2\x8a" => "\xc5\xa0",     /* LATIN CAPITAL LETTER S WITH CARON */
                "\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
                "\xc2\x8c" => "\xc5\x92",     /* LATIN CAPITAL LIGATURE OE */
                "\xc2\x8e" => "\xc5\xbd",     /* LATIN CAPITAL LETTER Z WITH CARON */
                "\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
                "\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
                "\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
                "\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
                "\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
                "\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
                "\xc2\x97" => "\xe2\x80\x94", /* EM DASH */

                "\xc2\x98" => "\xcb\x9c",     /* SMALL TILDE */
                "\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
                "\xc2\x9a" => "\xc5\xa1",     /* LATIN SMALL LETTER S WITH CARON */
                "\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
                "\xc2\x9c" => "\xc5\x93",     /* LATIN SMALL LIGATURE OE */
                "\xc2\x9e" => "\xc5\xbe",     /* LATIN SMALL LETTER Z WITH CARON */
                "\xc2\x9f" => "\xc5\xb8"      /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
            );

            return  strtr(utf8_encode($str), $cp1252_map);
    }


    $sql='SET NAMES "utf8" COLLATE "utf8_swedish_ci"';
    mysql_query($sql);


    $arr_book["booktitle"] = cp1252_to_utf8( iconv("UTF-8", "ISO-8859-1//TRANSLIT", $arr_book["

booktitle"]));

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

情痴 2024-10-08 07:39:24

如果同一列中有混合编码,则只有 1 个合理的选择:存储为二进制,而不是存储在特殊的字符集中。如果文件位于 cp1252 中(与 ISO-8859-1 大部分重叠,那么您可能可以将 cp1252 声明为输入) ),只需在加载为 XML 之前调用 iconv 函数即可。 ($utf8string = iconv('cp1252','utf-8',$string);)

If you have mixed encodings in the same column, you have only 1 reasonable option: store as binary, rather then in a special charset. If the file is in cp1252 though (which overlaps for a huge part with ISO-8859-1 so probably you can just claim cp1252 as input), just call the iconv function on it before loading as XML. ($utf8string = iconv('cp1252','utf-8',$string);)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文