将混合编码的xml文件数据保存到mysql数据库中的utf-8
我有一个混合编码的 xml 文件(虽然文件以 iso-8859-1 编码表示),但也包含来自 windows 1252 的字符(商标符号、endash 等)
我使用 PHP 和 xmlreader 解析 xml 文件以保存在数据库中。 MySQL 5.0 服务器将混合编码字符保存为方框字符,但 MySQL 5.1 给出错误。
那么问题来了,正确保存utf-8数据最简单、最充分的证明方法是什么。
这是我当前将其转换为 utf-8 的代码,只是想知道转换时是否会产生问题?
function cp1252_to_utf8($str)
{
$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
"\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
"\xc2\x83" => "\xc6\x92", /* LATIN SMALL LETTER F WITH HOOK */
"\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
"\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
"\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
"\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
"\xc2\x88" => "\xcb\x86", /* MODIFIER LETTER CIRCUMFLEX ACCENT */
"\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
"\xc2\x8a" => "\xc5\xa0", /* LATIN CAPITAL LETTER S WITH CARON */
"\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
"\xc2\x8c" => "\xc5\x92", /* LATIN CAPITAL LIGATURE OE */
"\xc2\x8e" => "\xc5\xbd", /* LATIN CAPITAL LETTER Z WITH CARON */
"\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
"\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
"\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
"\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
"\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
"\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
"\xc2\x97" => "\xe2\x80\x94", /* EM DASH */
"\xc2\x98" => "\xcb\x9c", /* SMALL TILDE */
"\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
"\xc2\x9a" => "\xc5\xa1", /* LATIN SMALL LETTER S WITH CARON */
"\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
"\xc2\x9c" => "\xc5\x93", /* LATIN SMALL LIGATURE OE */
"\xc2\x9e" => "\xc5\xbe", /* LATIN SMALL LETTER Z WITH CARON */
"\xc2\x9f" => "\xc5\xb8" /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);
return strtr(utf8_encode($str), $cp1252_map);
}
$sql='SET NAMES "utf8" COLLATE "utf8_swedish_ci"';
mysql_query($sql);
$arr_book["booktitle"] = cp1252_to_utf8( iconv("UTF-8", "ISO-8859-1//TRANSLIT", $arr_book["
booktitle"]));
I have an xml file with mixed encoding (file is to be said in iso-8859-1 encoding though) but contain characters from windows 1252 also (trademark symbol, endash etc)
Im using PHP and xmlreader to parse xml file to save in database. MySQL 5.0 server is saving the mixed encoded characters as box character but MySQL 5.1 gives error.
so the question is, what is the easiest and full proof method to correctly save the utf-8 data.
This is my current code to convert it to utf-8, just wanted to know, if it may create problem while converting?
function cp1252_to_utf8($str)
{
$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
"\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
"\xc2\x83" => "\xc6\x92", /* LATIN SMALL LETTER F WITH HOOK */
"\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
"\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
"\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
"\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
"\xc2\x88" => "\xcb\x86", /* MODIFIER LETTER CIRCUMFLEX ACCENT */
"\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
"\xc2\x8a" => "\xc5\xa0", /* LATIN CAPITAL LETTER S WITH CARON */
"\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
"\xc2\x8c" => "\xc5\x92", /* LATIN CAPITAL LIGATURE OE */
"\xc2\x8e" => "\xc5\xbd", /* LATIN CAPITAL LETTER Z WITH CARON */
"\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
"\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
"\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
"\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
"\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
"\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
"\xc2\x97" => "\xe2\x80\x94", /* EM DASH */
"\xc2\x98" => "\xcb\x9c", /* SMALL TILDE */
"\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
"\xc2\x9a" => "\xc5\xa1", /* LATIN SMALL LETTER S WITH CARON */
"\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
"\xc2\x9c" => "\xc5\x93", /* LATIN SMALL LIGATURE OE */
"\xc2\x9e" => "\xc5\xbe", /* LATIN SMALL LETTER Z WITH CARON */
"\xc2\x9f" => "\xc5\xb8" /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);
return strtr(utf8_encode($str), $cp1252_map);
}
$sql='SET NAMES "utf8" COLLATE "utf8_swedish_ci"';
mysql_query($sql);
$arr_book["booktitle"] = cp1252_to_utf8( iconv("UTF-8", "ISO-8859-1//TRANSLIT", $arr_book["
booktitle"]));
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果同一列中有混合编码,则只有 1 个合理的选择:存储为二进制,而不是存储在特殊的字符集中。如果文件位于 cp1252 中(与 ISO-8859-1 大部分重叠,那么您可能可以将 cp1252 声明为输入) ),只需在加载为 XML 之前调用 iconv 函数即可。 (
$utf8string = iconv('cp1252','utf-8',$string);
)If you have mixed encodings in the same column, you have only 1 reasonable option: store as binary, rather then in a special charset. If the file is in
cp1252
though (which overlaps for a huge part withISO-8859-1
so probably you can just claimcp1252
as input), just call theiconv
function on it before loading as XML. ($utf8string = iconv('cp1252','utf-8',$string);
)