如何在 PHP 中删除重复的嵌套 DOM 元素?
假设您有一个带有嵌套标签的 DOM 树,我想通过删除重复项来清理 DOM 对象。但是,这仅适用于标签仅具有相同类型的单个子标签的情况。例如,
修复
而不是
1;2
。我试图弄清楚如何使用 PHP 的 DOM 扩展 来做到这一点。下面是起始代码,我正在寻求帮助来找出所需的逻辑。
<?php
libxml_use_internal_errors(TRUE);
$html = '<div><div><div><p>Some text here</p></div></div></div>';
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
function dom_remove_duplicate_nodes($node)
{
var_dump($node);
if($node->hasChildNodes())
{
for($i = 0; $i < $node->childNodes->length; $i++)
{
$child = $node->childNodes->item($i);
dom_remove_duplicate_nodes($child);
}
}
else
{
// Process here?
}
}
dom_remove_duplicate_nodes($dom);
我收集了一些辅助函数,这些函数可能会让像 JavaScript 一样更轻松地使用 DOM 节点。
function DOM_delete_node($node)
{
DOM_delete_children($node);
return $node->parentNode->removeChild($node);
}
function DOM_delete_children($node)
{
while (isset($node->firstChild))
{
DOM_delete_children($node->firstChild);
$node->removeChild($node->firstChild);
}
}
function DOM_dump_child_nodes($node)
{
$output = '';
$owner_document = $node->ownerDocument;
foreach ($node->childNodes as $el)
{
$output .= $owner_document->saveHTML($el);
}
return $output;
}
function DOM_dump_node($node)
{
if($node->ownerDocument)
{
return $node->ownerDocument->saveHTML($node);
}
}
Assuming you have a DOM tree with nested tags, I would like to clean the DOM object up by removing duplicates. However, this should only apply if the tag only has a single child tag of the same type. For example,
Fix <div><div>1</div></div>
and not <div><div>1</div><div>2</div></div>
.
I'm trying to figure out how I could do this using PHP's DOM extension. Below is the starting code and I'm looking for help figuring out the logic needed.
<?php
libxml_use_internal_errors(TRUE);
$html = '<div><div><div><p>Some text here</p></div></div></div>';
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
function dom_remove_duplicate_nodes($node)
{
var_dump($node);
if($node->hasChildNodes())
{
for($i = 0; $i < $node->childNodes->length; $i++)
{
$child = $node->childNodes->item($i);
dom_remove_duplicate_nodes($child);
}
}
else
{
// Process here?
}
}
dom_remove_duplicate_nodes($dom);
I collected some helper functions that might make it easier to work the DOM nodes like JavaScript.
function DOM_delete_node($node)
{
DOM_delete_children($node);
return $node->parentNode->removeChild($node);
}
function DOM_delete_children($node)
{
while (isset($node->firstChild))
{
DOM_delete_children($node->firstChild);
$node->removeChild($node->firstChild);
}
}
function DOM_dump_child_nodes($node)
{
$output = '';
$owner_document = $node->ownerDocument;
foreach ($node->childNodes as $el)
{
$output .= $owner_document->saveHTML($el);
}
return $output;
}
function DOM_dump_node($node)
{
if($node->ownerDocument)
{
return $node->ownerDocument->saveHTML($node);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用
DOMDocument
和DOMXPath
轻松完成此操作。 XPath 在您的情况下特别有用,因为您可以轻松划分逻辑来选择要删除的元素以及删除元素的方式。首先,标准化输入。我并不完全清楚空空格的含义,我认为它可能是空文本节点(可能已被删除为
preserveWhiteSpace
是FALSE
但我不确定)或者它们的规范化空白是否为空。我选择了第一个(如果有必要的话),万一它是另一个变体,我留下了评论,用什么代替:在这个文本节点规范化之后,你不应该遇到你在一条评论中谈到的问题。
下一部分是查找与其父元素同名且是唯一子元素的所有元素。这又可以用xpath来表达。如果找到此类元素,它们的所有子元素都会移动到父元素,然后该元素也将被删除:
完整演示。
正如您在演示中看到的,这独立于文本节点和注释。如果您不希望这样,例如实际文本,则计算子级的表达式需要覆盖所有节点类型。但我不知道这是否是您的确切需求。如果是,这会计算所有节点类型的子节点数:
如果您没有预先规范化空文本节点(删除空文本节点),那么这太严格了。选择您需要的工具集,我认为规范化加上这条严格的规则可能是最好的选择。
You can do this quite easily with
DOMDocument
andDOMXPath
. XPath especially is really useful in your case because you easily divide the logic to select which elements to remove and the way you remove the elements.First of all, normalize the input. I was not entirely clear about what you mean with empty whitespace, I thought it could be either empty textnodes (which might have been removed as
preserveWhiteSpace
isFALSE
but I'm not sure) or if their normalized whitespace is empty. I opted for the first (if even necessary), in case it's the other variant I left a comment what to use instead:After this textnode normalization you should not run into the problem you talked about in one comment here.
The next part is to find all elements that have the same name as their parent element and which are the only child. This can be expressed in xpath again. If such elements are found, all their children are moved to the parent element and then the element will be removed as well:
Full demo.
As you can see in the demo, this is independent to textnodes and commments. If you don't want that, e.g. actual texts, the expression to count children needs to stretch over all node types. But I don't know if that is your exact need. If it is, this makes the count of children across all node types:
If you did not normalize empty textnodes upfront (remove empty ones), then this too strict. Choose the set of tools you need, I think normalizing plus this strict rule might be the best choice.
看起来你在这里几乎拥有了你需要的一切。您在哪里有
// Process here?
执行如下操作:此外,您当前在
dom_remove_duplicate_notes()
中使用递归,这在计算上可能会很昂贵。可以使用如下方法迭代文档中的每个节点,而无需递归: https://github.com/elazar/domquery/blob/master/trunk/DOMQuery.php#L73Seems like you have almost everything you need here. Where you have
// Process here?
do something like this:Also, you're currently using recursion in
dom_remove_duplicate_notes()
which can be computationally expensive. It is possible to iterate over every node in the document without recursion using an approach like this: https://github.com/elazar/domquery/blob/master/trunk/DOMQuery.php#L73以下是一个几乎可以工作的片段。虽然它确实删除了重复的嵌套节点,但由于
->appendChild()
,它更改了源顺序。The following is an almost-working snippet. While it does remove duplicate, nested nodes - it changes the source order because of the
->appendChild()
.