防止 DOMDocument::loadHTML() 转换实体
我有一个字符串值,我正在尝试为其提取列表项。我想提取文本和任何子节点,但是,DOMDocument 正在将实体转换为字符,而不是保留原始状态。
我尝试将 DOMDocument::resolveExternals 和 DOMDocument::substituteEntities 设置为 false,但这没有效果。应该注意的是,我在 Win7 上运行 PHP 5.2.17。
示例代码为:
$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li></ul>';
echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;
$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($example);
$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;
for ($idx = 0; $idx < $count; $idx++) {
$value = trim(_get_inner_html($domNodeList->item($idx)));
/* remainder of processing and storing in database */
echo 'Saved '.$value.PHP_EOL;
}
function _get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
½
最终转换为 ½(单字符/UTF-8 版本,而不是实体版本),这不是所需的格式。
I have a string value that I'm trying to extract list items for. I'd like to extract the text and any subnodes, however, DOMDocument is converting the entities to the character, instead of leaving in the original state.
I've tried setting DOMDocument::resolveExternals and DOMDocument::substituteEntities for false, but this has no effect. It should be noted I'm running on Win7 with PHP 5.2.17.
Example code is:
$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li></ul>';
echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;
$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($example);
$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;
for ($idx = 0; $idx < $count; $idx++) {
$value = trim(_get_inner_html($domNodeList->item($idx)));
/* remainder of processing and storing in database */
echo 'Saved '.$value.PHP_EOL;
}
function _get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
½
ends up getting converted to ½ (single character / UTF-8 version, not entity version), which is not the desired format.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不是 PHP 5.3.6++ 的解决方案
Solution for not PHP 5.3.6++
根据 ajreal 提供的 答案,我扩展了该示例变量来处理更多情况,并更改 _get_inner_html() 以进行递归调用并处理文本节点的实体转换。
这可能不是最好的答案,因为它对元素做出了一些假设(例如没有属性)。但由于我的特殊需求不需要传递属性(但是......我确信我的示例数据稍后会向我抛出该属性),因此该解决方案适合我。
Based on the answer provided by ajreal, I've expanded the example variable to handle more cases, and changed _get_inner_html() to make recursive calls and handle the entity conversion for text nodes.
It's probably not the best answer, since it makes some assumptions about the elements (such as no attributes). But since my particular needs don't require attributes to be carried across (yet.. I'm sure my sample data will throw that one at me later on), this solution works for me.
我有点晚了,也许这不完全是你的情况,但我讨厌黑客,我找到了避免你正在谈论的转换的最干净的方法:
I'm a bit late and maybe it's not exactly your case, but I hate hacks and I found the cleanest way to avoid the conversions you're talking about:
不需要迭代子节点:
Need no iterate child nodes: