PHP:简单的 XML 和不同的代码页并正确获取数据

发布于 2024-10-16 23:47:54 字数 1229 浏览 7 评论 0原文

我正在开发这个项目,我从不同的来源收到不同的 XML 文件。我的 PHP 脚本应该读取它们,解析它们,并将它们存储到 mysql 数据库中。

为了解析 XML 文件,我使用 PHP 中的 SimpleXMLElement 类。我以 UTF-8 编码从比利时接收文件,以 iso-8859-1 编码从德国接收文件,以 cp1250 编码从捷克共和国接收文件,依此类推...

当我将 xml 数据传递给 SimpleXMLElement 并在上打印 asXML() 时通过这个对象,我可以正确地看到 xml 数据,就像在原始 xml 文件中一样。 当我尝试将一个字段分配给 PHP 变量并在屏幕上打印该变量时,文本看起来已损坏,当然在插入 mysql 数据库时也已损坏。

示例:

XML:

<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km  ;  Dìèín - Rozb 741,85km </name>
...

PHP 代码:

$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";

代码结果(在 linux bash shell 上)将光标向上移动,然后打印: bín - Rozb 741,85km ; Dä(光标移动当然与PHP打印出的错误字符有关)

我认为PHP将其数据转换为UTF-8以将其存储在字符串参数中,因此我推测使用mb_convert_encoding从UTF-8转换8 到 cp1250 会显示正确的结果,但事实并非如此。此外,我应该能够以可与所有其他来源结合的格式存储数据。

我对编码/代码页了解不多,这可能就是我无法使其正常工作的原因,但我所知道的是,如果我将不同语言的文本复制/粘贴到例如新的 UltraEdit 文件中,全部都显示正确。 UltraEdit 如何处理这个问题?它是否使用UTF-8(我认为它可以显示任何内容?)

如何转换我的数据,以便它始终显示,无论源上使用什么编码?

I am working on this project where I receive different XML files from different sources. My PHP script should read them, parse them, and store them into the mysql database.

To parse the XML files, I use the SimpleXMLElement class in PHP. I receive files from Belgium in UTF-8 encoding, from Germany in iso-8859-1 encoding, from the Czech Republic in cp1250, and so on...

When I pass the xml-data to SimpleXMLElement and print an asXML() on this object, I see the xml data correctly as it was in the original xml file.
When I try to assign a field to a PHP-variable and print this variable on the screen, the text looks corrupted, and is of course also corrupted when inserted into the mysql database.

Example:

The XML:

<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km  ;  Dìèín - Rozb 741,85km </name>
...

The PHP code:

$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";

Result of the code (on linux bash shell) moves the cursor upwards and then prints: bín - Rozb 741,85km ; DÄ (the cursor movement is of course related to the incorrect characters that are printed out by PHP)

I think that PHP converts its data to UTF-8 to store it in a string parameter, so I presumed that using mb_convert_encoding to convert from UTF-8 to cp1250 would show the correct result, but it doesn't. Also I should be able to store the data in a format that is combinable with all the other sources.

I don't know much about encodings/codepages, this is probably why I can't get it to work right, but what I do know is that if I copy/paste the texts from the different languages to for example a new UltraEdit file, all of them show up right. How does UltraEdit handle this? Does it use UTF-8 (which I presume can show anything?)

How can I convert my data so that it will always show up, with whatever encoding on the source?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

遮云壑 2024-10-23 23:47:54

尝试使用 iconv 代替:

$str = iconv('UTF-8', 'WINDOWS-1250', $str);

Try iconv instead:

$str = iconv('UTF-8', 'WINDOWS-1250', $str);
不再见 2024-10-23 23:47:54

问题是您的输入文件格式错误。 Windows-1250 中没有字符 ì(带坟墓的拉丁小写字母 I)。请参阅此处

最接近的字符是 U+00ED(带锐音的拉丁文小写字母 I)。

这种字符在 shell 中正确显示的事实可能是偶然的。

The problem is your input file is malformed. There is no character ì (latin small letter I with grave) in Windows-1250. See here.

The closest character is U+00ED (LATIN SMALL LETTER I WITH ACUTE).

The fact such character shows correctly in the shell is likely fortuitous.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文