XMLReader -- 遇到 utf 字符问题

发布于 2024-09-27 21:46:31 字数 488 浏览 8 评论 0原文

我正在解析一个巨大的 xml 文件,并且需要说明文件的编码
<强>< ? xml version="1.0"encoding="ISO-8859-1" ?>**bold

数据库编码是 utf8,我在将任何内容保存到数据库之前运行此查询
$sql='设置名称“utf8”整理“utf8_swedish_ci”';

问题是有时 xml 文件中会出现一些非标准字符,例如
Lycka™:罗马
我知道商标符号来自windows-1252编码。

我正在使用 php。我试过utf8_encode。

这里保存在数据库 alt text 中,

这是浏览器中的输出 alt text

我想将其转换为utf,就是这样

I am parsing a huge xml file and encoding of file is to be said
< ? xml version="1.0" encoding="ISO-8859-1" ?>**bold

The db encoding is utf8 and I am running this query before anything is saved to db
$sql='SET NAMES "utf8" COLLATE "utf8_swedish_ci"';

What the problem is that sometimes some non standard characters comes in the xml file like
Lycka™ : roman
I know that trademark symbol is from windows-1252 encoding.

Im using php. I have tried utf8_encode.

here is saved in db alt text and

here is the output in browser alt text

I want it to converted to utf, that's it

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

溺渁∝ 2024-10-04 21:46:31

您是否尝试在保存到 db 之前将字符串编码为 utf8 ?
对于 php 有 utf8_encode() 函数,您使用的语言中可能有类似的函数。

Did you try encoding the string in utf8 before saving to db ?
For php there is utf8_encode() function, there might be similar functions in the language you are using.

病女 2024-10-04 21:46:31

我使用了这段代码并且工作得很好

function cp1252_to_utf8($str) 
{

        $cp1252_map = array(
                "\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
                "\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
                "\xc2\x83" => "\xc6\x92",     /* LATIN SMALL LETTER F WITH HOOK */
                "\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
                "\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
                "\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
                "\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
                "\xc2\x88" => "\xcb\x86",     /* MODIFIER LETTER CIRCUMFLEX ACCENT */
                "\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
                "\xc2\x8a" => "\xc5\xa0",     /* LATIN CAPITAL LETTER S WITH CARON */
                "\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
                "\xc2\x8c" => "\xc5\x92",     /* LATIN CAPITAL LIGATURE OE */
                "\xc2\x8e" => "\xc5\xbd",     /* LATIN CAPITAL LETTER Z WITH CARON */
                "\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
                "\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
                "\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
                "\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
                "\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
                "\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
                "\xc2\x97" => "\xe2\x80\x94", /* EM DASH */

                "\xc2\x98" => "\xcb\x9c",     /* SMALL TILDE */
                "\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
                "\xc2\x9a" => "\xc5\xa1",     /* LATIN SMALL LETTER S WITH CARON */
                "\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
                "\xc2\x9c" => "\xc5\x93",     /* LATIN SMALL LIGATURE OE */
                "\xc2\x9e" => "\xc5\xbe",     /* LATIN SMALL LETTER Z WITH CARON */
                "\xc2\x9f" => "\xc5\xb8"      /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
        );

        return  strtr(utf8_encode($str), $cp1252_map);
}


$str = cp1252_to_utf8( iconv("UTF-8", "ISO-8859-1//TRANSLIT", $str));

I used this code and worked fine

function cp1252_to_utf8($str) 
{

        $cp1252_map = array(
                "\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
                "\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
                "\xc2\x83" => "\xc6\x92",     /* LATIN SMALL LETTER F WITH HOOK */
                "\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
                "\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
                "\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
                "\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
                "\xc2\x88" => "\xcb\x86",     /* MODIFIER LETTER CIRCUMFLEX ACCENT */
                "\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
                "\xc2\x8a" => "\xc5\xa0",     /* LATIN CAPITAL LETTER S WITH CARON */
                "\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
                "\xc2\x8c" => "\xc5\x92",     /* LATIN CAPITAL LIGATURE OE */
                "\xc2\x8e" => "\xc5\xbd",     /* LATIN CAPITAL LETTER Z WITH CARON */
                "\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
                "\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
                "\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
                "\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
                "\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
                "\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
                "\xc2\x97" => "\xe2\x80\x94", /* EM DASH */

                "\xc2\x98" => "\xcb\x9c",     /* SMALL TILDE */
                "\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
                "\xc2\x9a" => "\xc5\xa1",     /* LATIN SMALL LETTER S WITH CARON */
                "\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
                "\xc2\x9c" => "\xc5\x93",     /* LATIN SMALL LIGATURE OE */
                "\xc2\x9e" => "\xc5\xbe",     /* LATIN SMALL LETTER Z WITH CARON */
                "\xc2\x9f" => "\xc5\xb8"      /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
        );

        return  strtr(utf8_encode($str), $cp1252_map);
}


$str = cp1252_to_utf8( iconv("UTF-8", "ISO-8859-1//TRANSLIT", $str));
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文