在这种情况下，建议使用命名空间解析无效 xml 响应的正确方法

发布于 2024-11-19 17:37:30 字数 3623 浏览 1 评论 0原文

我正在使用 php 来解析 API 的 xml 响应。这是一个示例响应 -

$xml = '<?xml version="1.0"?>
                    <q:response xmlns:q="http://api-url">
                        <q:impression>
                            <q:content>
                                <html>
                                        <meta name="HandheldFriendly" content="True">
                                        <meta name="viewport" content="width=device-width, user-scalable=no">
                                        <meta http-equiv="cleartype" content="on">
                                    </head>
                                    <body style="margin:0px;padding:0px;">
                                        <iframe scrolling="no" src="http://api-response-url/with/lots?of=parameters&somethingmore=someval" width="320px" height="50px" style="border:none;"></iframe>
                                    </body>
                                </html>
                            </q:content>
                            <q:cpc>0.02</q:cpc>
                        </q:impression>
                    </q:response>';

注意以下几点 -

响应有一些像这样的无效标记 -

标记在不存在，但已关闭。
内的标签未关闭。
iframe 的 src 属性包含一个 URL，其中包含多个由 & 分隔的参数。因此，这个 URL 和任何其他可能的 URL 都需要在 $dom->loadXML(); 之前进行 urlencode 编码（请参阅下面的代码）。

要求

我需要读取标记内的所有内容。
我需要解析无效的标记（正如我所得到的）并正确阅读内容。
url 需要针对我需要在 XML 文档中转义哪些字符？。这需要按照我当前遵循的逻辑来完成。

当前代码

因此，到目前为止，如果标记内的内容是有效标记，我的代码可以正常工作 -

$dom = new DOMDocument;

$dom->loadXML($xml); // load the XML string defined above - works only if entire xml is valid 

$adHtml = "";

foreach ($dom->getElementsByTagNameNS('http://api-url', '*') as $element) 
{
    if($element->localName == "content")
    {
         $children = $element->childNodes; 

         foreach ($children as $child) 
         {
              $adHtml .= $child->ownerDocument->saveXML($child); 
         }

    }

}

echo $adHtml; //Have got necessary contents here

检查工作代码此处（在 iframe src 中具有有效的标记和单个参数）。

我现在在想什么

现在，采用@hakre在我的上一个问题 -

我尝试过DOMDocument::loadHTML() 它按照我的预期失败了。给出警告，例如 - Warning: DOMDocument::loadHTML(): Tag q:response invalid in Entity, line: 2
对我需要在 XML 中转义哪些字符文档？。

问题

最后，如果我必须“转义字符串的特定部分”（在我的例子中，查找 之间的内容） ;）正如 urlencode 的答案中给出的那样，那么为什么我不应该寻找这些分隔符（）在第一名并返回那个？那么在这种情况下使用 DOMDocument::loadXML() 的好处是什么？我想这是一个很常见的情况......

所以，我的问题是这个要求以及注意以下几点-下给出的要点，什么是最聪明的方法继续吗？

原文

I am using php to parse xml response of an API. Here is a sample response -

$xml = '<?xml version="1.0"?>
                    <q:response xmlns:q="http://api-url">
                        <q:impression>
                            <q:content>
                                <html>
                                        <meta name="HandheldFriendly" content="True">
                                        <meta name="viewport" content="width=device-width, user-scalable=no">
                                        <meta http-equiv="cleartype" content="on">
                                    </head>
                                    <body style="margin:0px;padding:0px;">
                                        <iframe scrolling="no" src="http://api-response-url/with/lots?of=parameters&somethingmore=someval" width="320px" height="50px" style="border:none;"></iframe>
                                    </body>
                                </html>
                            </q:content>
                            <q:cpc>0.02</q:cpc>
                        </q:impression>
                    </q:response>';

Note the following points -

The response has some invalid markup like this -

<head> tag start inside <html> is not there but it is closed.
<meta> tags inside <html> are not closed.
The iframe's src attribute contains a URL with multiple params separated by &. So, this and any other possible URLs need to be urlencoded before the $dom->loadXML(); (see my code below).

Requirement

I need to read whatever is there inside the <q:content></q:content> tags.
I need to parse invalid markup (as I am getting) and properly read the content.
url's need to be encoded for the characters as listed in What characters do I need to escape in XML documents?. This needs to be done with the current logic I am following.

Current code

So, far I have this code which works fine if the contents inside the <q:content></q:content> tags is valid markup -

$dom = new DOMDocument;

$dom->loadXML($xml); // load the XML string defined above - works only if entire xml is valid 

$adHtml = "";

foreach ($dom->getElementsByTagNameNS('http://api-url', '*') as $element) 
{
    if($element->localName == "content")
    {
         $children = $element->childNodes; 

         foreach ($children as $child) 
         {
              $adHtml .= $child->ownerDocument->saveXML($child); 
         }

    }

}

echo $adHtml; //Have got necessary contents here

Check working code here (with valid markup and single param in iframe src).

What I am thinking now

Now, going with the solution given by @hakre in my previous question -

I tried with DOMDocument::loadHTML() and it fails as I expected. Gives warnings like - Warning: DOMDocument::loadHTML(): Tag q:response invalid in Entity, line: 2
escape a specific part of the string for characters listed in What characters do I need to escape in XML documents?.

Question

Finally, if I have to "escape a specific part of the string" (in my case look for whatever is there in between the <q:content></q:content>) as given in that answer to urlencode whatever is there, then why shouldn't I look for the those delimiters (<q:content></q:content>) in the first place and return that? Then what is the benefit of using DOMDocument::loadXML() in such cases? I guess this is a pretty common case...

So, my question is given this Requirement and the points given under Note the following points -, what is the most clever way to proceed?

分享到QQ

分享到微博