在这种情况下,建议使用命名空间解析无效 xml 响应的正确方法
我正在使用 php 来解析 API 的 xml
响应。这是一个示例响应 -
$xml = '<?xml version="1.0"?>
<q:response xmlns:q="http://api-url">
<q:impression>
<q:content>
<html>
<meta name="HandheldFriendly" content="True">
<meta name="viewport" content="width=device-width, user-scalable=no">
<meta http-equiv="cleartype" content="on">
</head>
<body style="margin:0px;padding:0px;">
<iframe scrolling="no" src="http://api-response-url/with/lots?of=parameters&somethingmore=someval" width="320px" height="50px" style="border:none;"></iframe>
</body>
</html>
</q:content>
<q:cpc>0.02</q:cpc>
</q:impression>
</q:response>';
注意以下几点 -
响应有一些像这样的无效标记 -
标记在
不存在,但已关闭。
内的
标签未关闭。
- iframe 的
src
属性包含一个 URL,其中包含多个由&
分隔的参数。因此,这个 URL 和任何其他可能的 URL 都需要在$dom->loadXML();
之前进行 urlencode 编码(请参阅下面的代码)。
要求
- 我需要读取
标记内的所有内容。 - 我需要解析无效的标记(正如我所得到的)并正确阅读内容。
- url 需要针对 我需要在 XML 文档中转义哪些字符?。这需要按照我当前遵循的逻辑来完成。
当前代码
因此,到目前为止,如果
标记内的内容是有效标记,我的代码可以正常工作 -
$dom = new DOMDocument;
$dom->loadXML($xml); // load the XML string defined above - works only if entire xml is valid
$adHtml = "";
foreach ($dom->getElementsByTagNameNS('http://api-url', '*') as $element)
{
if($element->localName == "content")
{
$children = $element->childNodes;
foreach ($children as $child)
{
$adHtml .= $child->ownerDocument->saveXML($child);
}
}
}
echo $adHtml; //Have got necessary contents here
检查工作代码此处(在 iframe src 中具有有效的标记和单个参数)。
我现在在想什么
现在,采用@hakre在我的上一个问题 -
我尝试过
DOMDocument::loadHTML()
它按照我的预期失败了。给出警告,例如 -Warning: DOMDocument::loadHTML(): Tag q:response invalid in Entity, line: 2
问题
最后,如果我必须“转义字符串的特定部分”(在我的例子中,查找
)正如 urlencode 的答案中给出的那样,那么为什么我不应该寻找这些分隔符(
)在第一名并返回那个?那么在这种情况下使用 DOMDocument::loadXML() 的好处是什么?我想这是一个很常见的情况......
所以,我的问题是这个要求以及注意以下几点-下给出的要点,什么是最聪明的方法继续吗?
I am using php to parse xml
response of an API. Here is a sample response -
$xml = '<?xml version="1.0"?>
<q:response xmlns:q="http://api-url">
<q:impression>
<q:content>
<html>
<meta name="HandheldFriendly" content="True">
<meta name="viewport" content="width=device-width, user-scalable=no">
<meta http-equiv="cleartype" content="on">
</head>
<body style="margin:0px;padding:0px;">
<iframe scrolling="no" src="http://api-response-url/with/lots?of=parameters&somethingmore=someval" width="320px" height="50px" style="border:none;"></iframe>
</body>
</html>
</q:content>
<q:cpc>0.02</q:cpc>
</q:impression>
</q:response>';
Note the following points -
The response has some invalid markup like this -
<head>
tag start inside<html>
is not there but it is closed.<meta>
tags inside<html>
are not closed.- The iframe's
src
attribute contains a URL with multiple params separated by&
. So, this and any other possible URLs need to be urlencoded before the$dom->loadXML();
(see my code below).
Requirement
- I need to read whatever is there inside the
<q:content></q:content>
tags. - I need to parse invalid markup (as I am getting) and properly read the content.
- url's need to be encoded for the characters as listed in What characters do I need to escape in XML documents?. This needs to be done with the current logic I am following.
Current code
So, far I have this code which works fine if the contents inside the <q:content></q:content>
tags is valid markup -
$dom = new DOMDocument;
$dom->loadXML($xml); // load the XML string defined above - works only if entire xml is valid
$adHtml = "";
foreach ($dom->getElementsByTagNameNS('http://api-url', '*') as $element)
{
if($element->localName == "content")
{
$children = $element->childNodes;
foreach ($children as $child)
{
$adHtml .= $child->ownerDocument->saveXML($child);
}
}
}
echo $adHtml; //Have got necessary contents here
Check working code here (with valid markup and single param in iframe src).
What I am thinking now
Now, going with the solution given by @hakre in my previous question -
I tried with
DOMDocument::loadHTML()
and it fails as I expected. Gives warnings like -Warning: DOMDocument::loadHTML(): Tag q:response invalid in Entity, line: 2
escape a specific part of the string for characters listed in What characters do I need to escape in XML documents?.
Question
Finally, if I have to "escape a specific part of the string" (in my case look for whatever is there in between the <q:content></q:content>
) as given in that answer to urlencode whatever is there, then why shouldn't I look for the those delimiters (<q:content></q:content>
) in the first place and return that? Then what is the benefit of using DOMDocument::loadXML()
in such cases? I guess this is a pretty common case...
So, my question is given this Requirement and the points given under Note the following points -, what is the most clever way to proceed?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在实施一项标准时,人们可以做出许多有效的选择。然而,不存在违反标准的有效选择。您需要向那些向您发送此数据的人展示他们在实施 XML 标准时的一些有效选择。
其中一种选择是将 HTML 内容放置在
CDATA
中。另一种方法是对 HTML 进行编码。他们向您发送垃圾并将其称为 XML 是完全不能接受的。也许他们没有意识到这不是有效的 XML,但事实并非如此。如果他们不相信您,那么您应该尝试在标准 XML 编辑器(例如 XMLspy)中打开“XML”。让他们求助于第三方 XMLspy,该第三方可以告诉他们他们的 XML 是否有效。
然后,他们可以自由选择如何生成有效的 XML,并且您需要处理他们的选择。
One can make many valid choices when implementing a standard. However, there are no valid choices in violating a standard. You need to present to those sending you this data some of their valid choices in implementing the XML standard.
One of those choices would be to place the HTML content within
CDATA
. Another would be to encode the HTML.It is simply not acceptable for them to send you garbage and to call it XML. Maybe they don't realize that it's not valid XML, but it's simply not. If they don't believe you, then you should simply try to open the "XML" in a standard XML editor such as XMLspy. Let them appeal to XMLspy as a third party which can tell them whether their XML is valid.
They can then be free to choose how to produce valid XML, and you'll be required to handle their choice.