PHP:简单的 XML 和不同的代码页并正确获取数据
我正在开发这个项目,我从不同的来源收到不同的 XML 文件。我的 PHP 脚本应该读取它们,解析它们,并将它们存储到 mysql 数据库中。
为了解析 XML 文件,我使用 PHP 中的 SimpleXMLElement 类。我以 UTF-8 编码从比利时接收文件,以 iso-8859-1 编码从德国接收文件,以 cp1250 编码从捷克共和国接收文件,依此类推...
当我将 xml 数据传递给 SimpleXMLElement 并在上打印 asXML() 时通过这个对象,我可以正确地看到 xml 数据,就像在原始 xml 文件中一样。 当我尝试将一个字段分配给 PHP 变量并在屏幕上打印该变量时,文本看起来已损坏,当然在插入 mysql 数据库时也已损坏。
示例:
XML:
<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km ; Dìèín - Rozb 741,85km </name>
...
PHP 代码:
$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";
代码结果(在 linux bash shell 上)将光标向上移动,然后打印: bín - Rozb 741,85km ; Dä(光标移动当然与PHP打印出的错误字符有关)
我认为PHP将其数据转换为UTF-8以将其存储在字符串参数中,因此我推测使用mb_convert_encoding从UTF-8转换8 到 cp1250 会显示正确的结果,但事实并非如此。此外,我应该能够以可与所有其他来源结合的格式存储数据。
我对编码/代码页了解不多,这可能就是我无法使其正常工作的原因,但我所知道的是,如果我将不同语言的文本复制/粘贴到例如新的 UltraEdit 文件中,全部都显示正确。 UltraEdit 如何处理这个问题?它是否使用UTF-8(我认为它可以显示任何内容?)
如何转换我的数据,以便它始终显示,无论源上使用什么编码?
I am working on this project where I receive different XML files from different sources. My PHP script should read them, parse them, and store them into the mysql database.
To parse the XML files, I use the SimpleXMLElement class in PHP. I receive files from Belgium in UTF-8 encoding, from Germany in iso-8859-1 encoding, from the Czech Republic in cp1250, and so on...
When I pass the xml-data to SimpleXMLElement and print an asXML() on this object, I see the xml data correctly as it was in the original xml file.
When I try to assign a field to a PHP-variable and print this variable on the screen, the text looks corrupted, and is of course also corrupted when inserted into the mysql database.
Example:
The XML:
<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km ; Dìèín - Rozb 741,85km </name>
...
The PHP code:
$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";
Result of the code (on linux bash shell) moves the cursor upwards and then prints: bÃn - Rozb 741,85km ; DÄ (the cursor movement is of course related to the incorrect characters that are printed out by PHP)
I think that PHP converts its data to UTF-8 to store it in a string parameter, so I presumed that using mb_convert_encoding to convert from UTF-8 to cp1250 would show the correct result, but it doesn't. Also I should be able to store the data in a format that is combinable with all the other sources.
I don't know much about encodings/codepages, this is probably why I can't get it to work right, but what I do know is that if I copy/paste the texts from the different languages to for example a new UltraEdit file, all of them show up right. How does UltraEdit handle this? Does it use UTF-8 (which I presume can show anything?)
How can I convert my data so that it will always show up, with whatever encoding on the source?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试使用 iconv 代替:
Try iconv instead:
问题是您的输入文件格式错误。 Windows-1250 中没有字符
ì
(带坟墓的拉丁小写字母 I)。请参阅此处。最接近的字符是 U+00ED(带锐音的拉丁文小写字母 I)。
这种字符在 shell 中正确显示的事实可能是偶然的。
The problem is your input file is malformed. There is no character
ì
(latin small letter I with grave) in Windows-1250. See here.The closest character is U+00ED (LATIN SMALL LETTER I WITH ACUTE).
The fact such character shows correctly in the shell is likely fortuitous.