PHP - 处理无效的 XML

发布于 2024-09-02 18:17:30 字数 456 浏览 8 评论 0原文

我正在使用 SimpleXML 加载一些 xml 文件(这些文件不是我编写/提供的,也不能真正更改其格式)。

有时(例如,每 50 个左右文件中的一两个文件)它们不会转义任何特殊字符(主要是 &,但有时也有其他随机无效的字符)。这会产生问题,因为带有 php 的 SimpleXML 失败了,而且我真的不知道有什么好方法来处理解析无效的 XML。

我的第一个想法是将 XML 作为字符串进行预处理,并将所有字段作为 CDATA 放入,这样它就可以工作,但由于某些不合理的原因,我需要处理的 XML 将其所有数据放入属性字段中。因此我不能使用 CDATA 的想法。 XML 的一个示例是:

 <Author v="By Someone & Someone" />

在使用 SimpleXML 加载 XML 之前,处理此问题以替换 XML 中的所有无效字符的最佳方法是什么?

I'm using SimpleXML to load in some xml files (which I didn't write/provide and can't really change the format of).

Occasionally (eg one or two files out of every 50 or so) they don't escape any special characters (mostly &, but sometimes other random invalid things too). This creates and issue because SimpleXML with php just fails, and I don't really know of any good way to handle parsing invalid XML.

My first idea was to preprocess the XML as a string and put ALL fields in as CDATA so it would work, but for some ungodly reason the XML I need to process puts all of its data in the attribute fields. Thus I can't use the CDATA idea. An example of the XML being:

 <Author v="By Someone & Someone" />

Whats the best way to process this to replace all the invalid characters from the XML before I load it in with SimpleXML?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

别闹i 2024-09-09 18:17:30

您需要的是使用 libxml 的内部错误来定位无效字符并相应地转义它们。这是我如何编写它的模型。查看 libxml_get_errors() 的结果以获取错误信息。

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
    {
        return $sxe;
    }

    $fixed_xml = '';
    $last_pos  = 0;

    foreach (libxml_get_errors() as $error)
    {
        // $pos is the position of the faulty character,
        // you have to compute it yourself
        $pos = compute_position($error->line, $error->column);
        $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($xml[$pos]);
        $last_pos = $pos + 1;
    }
    $fixed_xml .= substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}

What you need is something that will use libxml's internal errors to locate invalid characters and escape them accordingly. Here's a mockup of how I'd write it. Take a look at the result of libxml_get_errors() for error info.

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
    {
        return $sxe;
    }

    $fixed_xml = '';
    $last_pos  = 0;

    foreach (libxml_get_errors() as $error)
    {
        // $pos is the position of the faulty character,
        // you have to compute it yourself
        $pos = compute_position($error->line, $error->column);
        $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($xml[$pos]);
        $last_pos = $pos + 1;
    }
    $fixed_xml .= substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}
梦晓ヶ微光ヅ倾城 2024-09-09 18:17:30

我认为创建compute_position函数的解决方法是在处理之前使xml字符串变平。
重写 Josh 发布的代码:

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
    {
        return $sxe;
    }

    $fixed_xml = '';
    $last_pos  = 0;

    // make string flat
    $xml = str_replace(array("\r\n", "\r", "\n"), "", $xml);

    // get file encoding
    $encoding = mb_detect_encoding($xml);

    foreach (libxml_get_errors() as $error)
    {
        $pos = $error->column;
        $invalid_char = mb_substr($xml, $pos, 1, $encoding);
        $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($invalid_char);
        $last_pos = $pos + 1;
    }
    $fixed_xml .= substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}

我添加了编码内容,因为我在使用简单的 array[index] 方式从字符串获取字符时遇到了问题。

这一切都应该有效,但是,不知道为什么,我看到 $error->column 给了我一个与应有的数字不同的数字。尝试通过简单地在 xml 中添加一些无效字符并检查它将返回什么值来调试此问题,但没有成功。
希望有人能告诉我这种方法有什么问题。

i think workaroung for creating compute_position function will be make xml string flat before processing.
Rewrite code posted by Josh:

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
    {
        return $sxe;
    }

    $fixed_xml = '';
    $last_pos  = 0;

    // make string flat
    $xml = str_replace(array("\r\n", "\r", "\n"), "", $xml);

    // get file encoding
    $encoding = mb_detect_encoding($xml);

    foreach (libxml_get_errors() as $error)
    {
        $pos = $error->column;
        $invalid_char = mb_substr($xml, $pos, 1, $encoding);
        $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($invalid_char);
        $last_pos = $pos + 1;
    }
    $fixed_xml .= substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}

I've added encoding stuff becose i've had problems with simply array[index] way of getting character from string.

This all should work but, dont know why, i've seen that $error->column gives me a different number than it should. Trying to debug this by simply add some invalid characters inside xml and check what value it would return, but no luck with it.
Hope someone could tell me what is wrong with this approach.

把昨日还给我 2024-09-09 18:17:30

尽管这个问题已经存在 10 年了(当我输入这个问题时),我仍然遇到类似的 XML 解析问题 (PHP8.1),这就是我最终来到这里的原因。已经给出的答案很有帮助,但要么不完整、不一致,要么不适合我的问题,我也怀疑原始海报。

检查内部 XML 解析问题似乎是正确的,但有 735 个错误代码(请参阅 https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-xmlerror.html),因此更具适应性的解决方案似乎更合适。

我在上面使用了“不一致”这个词,因为其他最好的答案(@Adam Szmyd)将多字节字符串处理与非多字节字符串处理混合在一起。

下面的代码使用 Adam 的作为基础,我根据我的情况重新编写了它,我觉得可以根据实际遇到的问题进一步扩展。所以,我也不完整——抱歉!

这段代码的本质是,它将“每个”(在我的实现中,只有 1 个)XML 解析错误作为单独的情况进行处理。我遇到的错误是无法识别的 HTML 实体 (ç - ç),因此我使用 PHP 实体替换。

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
        return $sxe;

    $fixed_xml = '';
    $last_pos  = 0;

    // make string flat
    $xmlFlat = mb_ereg_replace( '(\r\n|\r|\n)', '', $xml );

    // Regenerate the error but using the flattened source so error offsets are directly relevant
    libxml_clear_errors();
    $xml_doc = @simplexml_load_string( $xmlFlat );

    foreach (libxml_get_errors() as $error)
    {
        $pos = $error->column - 1; // ->column appears to be 1 based, not 0 based

        switch( $error->code ) {

            case 26: // error undeclared entity
            case 27: // warning undeclared entity
                if ($pos >= 0) { // the PHP docs suggest this not always set (in which case ->column is == 0)

                    $left = mb_substr( $xmlFlat, 0, $pos );
                    $amp = mb_strrpos( $left, '&' );

                    if ($amp !== false) {

                        $entity = mb_substr( $left, $amp );
                        $fixed_xml .= mb_substr( $xmlFlat, $last_pos, $amp - $last_pos )
                            . html_entity_decode( $entity );
                        $last_pos = $pos;
                    }
                }
                break;

            default:
        }
    }
    $fixed_xml .= mb_substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}

Despite this problem being 10 years old (for when I'm typing this), I'm still experiencing similar XML parsing issues (PHP8.1), which is why I ended up here. The answers already given are helpful, but either incomplete, inconsistent or otherwise unsuitable for my problem and I suspect for the original poster too.

Inspecting internal XML parsing issues seems right, but there are 735 error codes (see https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-xmlerror.html), so a more adaptable solution seems appropriate.

I used the word "inconsistent" above because the best of the other answers (@Adam Szmyd) mixed multibyte string handling with non-multibyte string handling.

The following code uses Adam's as the base and I reworked it for my situation, which I feel could be extended further depending on the problems actually being experienced. So, I'm not complete either - sorry!

The essence of this code is that it handles "each" (in my implementation, just 1) XML parsing error as a separate case. The error I was experiencing was an unrecognised HTML entity (ç - ç), so I use PHP entity replacement instead.

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
        return $sxe;

    $fixed_xml = '';
    $last_pos  = 0;

    // make string flat
    $xmlFlat = mb_ereg_replace( '(\r\n|\r|\n)', '', $xml );

    // Regenerate the error but using the flattened source so error offsets are directly relevant
    libxml_clear_errors();
    $xml_doc = @simplexml_load_string( $xmlFlat );

    foreach (libxml_get_errors() as $error)
    {
        $pos = $error->column - 1; // ->column appears to be 1 based, not 0 based

        switch( $error->code ) {

            case 26: // error undeclared entity
            case 27: // warning undeclared entity
                if ($pos >= 0) { // the PHP docs suggest this not always set (in which case ->column is == 0)

                    $left = mb_substr( $xmlFlat, 0, $pos );
                    $amp = mb_strrpos( $left, '&' );

                    if ($amp !== false) {

                        $entity = mb_substr( $left, $amp );
                        $fixed_xml .= mb_substr( $xmlFlat, $last_pos, $amp - $last_pos )
                            . html_entity_decode( $entity );
                        $last_pos = $pos;
                    }
                }
                break;

            default:
        }
    }
    $fixed_xml .= mb_substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文