DOMDocument 和 php html 问题

发布于 2024-12-06 08:45:26 字数 533 浏览 0 评论 0原文

好吧。所以我使用 DOMDocument 来读取 html 文件。我注意到的一件事是，当我这样做时，

$doc = new DOMDocument();
$doc->loadHTML($htmlstring);
$doc->saveHTML();

它会添加 doctype 标头以及 html 和 body 标签。

我已经通过这样做解决了

$doc = new DOMDocument();
$doc->loadXML($htmlstring,LIBXML_NOXMLDECL);
$doc->saveXML();

这个问题，但问题是现在我所有的标签都区分大小写，如果我有多个文档根目录，它就会变得很生气。

是否有替代方法，以便我可以加载部分 html 文件、抓取标签等、替换它们并获取字符串，而无需手动解析文件？

基本上我想要 DOMDocument->loadHTML 的功能，而不需要添加标签和标题。

有什么想法吗？

原文

Alright. So I'm using DOMDocument to read html files. One thing I've noticed is that when I do this

$doc = new DOMDocument();
$doc->loadHTML($htmlstring);
$doc->saveHTML();

it will add on a doctype header, and html and body tags.

I've gotten around this by doing this

$doc = new DOMDocument();
$doc->loadXML($htmlstring,LIBXML_NOXMLDECL);
$doc->saveXML();

The problem with this however is the fact that now all my tags are case sensitive, and it gets mad if I have more than one document root.

Is there an alternative so that I can load up partial html files, grab tags and such, replace them, and get the string without having to parse the files manually?

Basically I want the functionallity of DOMDocument->loadHTML, without the added tags and header.

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若相惜即相离 2024-12-13 08:45:26

理论上，您可以告诉 libxml 不要添加隐含标记。实际上，PHP 的 libxml 绑定没有为此提供任何方法。如果您使用的是 PHP 5.3.6+，请将部分文档的根节点传递给 saveHTML()，然后它将为您提供该元素的外层 HTML，例如

$dom->saveHTML($dom->getElementsByTagName('body')->item(0));

只会返回 ; 带有子元素的元素。请参阅

如何返回 DOMDocument 的外部 html？

另请注意，您的部分文档具有多个根elements 之所以有效，是因为 loadHTML 添加了隐含的元素。如果您想要返回具有多个根（或者根本没有根）的部分，您可以自己添加一个假根：

$dom->loadHTML('<div id="partialroot">' . $partialDoc . '</div>');

然后根据需要处理文档，然后获取该假根的innerHTML

如何获取 DOMNode 的innerHTML？

另请参阅如何在 PHP 中解析和处理 HTML/XML？可能想尝试

In theory you could tell libxml not to add the implied markup. In practise, PHP's libxml bindings do not provide any means to that. If you are on PHP 5.3.6+ pass the root node of your partial document to saveHTML()which will then give you the outerHTML of that element, e.g.

$dom->saveHTML($dom->getElementsByTagName('body')->item(0));

would only return the <body> element with children. See

How to return outer html of DOMDocument?

Also note that your partial document with multiple root elements only works because loadHTML adds the implied elements. If you want a partial with multiple roots (or rather no root at all) back, you can add a fake root yourself:

$dom->loadHTML('<div id="partialroot">' . $partialDoc . '</div>');

Then process the document as needed and then fetch the innerHTML of that fake root

How to get innerHTML of DOMNode?

Also see How do you parse and process HTML/XML in PHP? for additional parsers you might want to try

回复收藏 0 原文

作妖 2024-12-13 08:45:26

您可以使用一些具有特定 id 的 div，然后从文档对象中，使用其 id 部分提取 div 对象。

回复收藏 0 原文

~没有更多了~

关于作者

过潦

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

DOMDocument 和 php html 问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

若能看破又如何

夢野间

doggiejohn

就此别过

初见终念

qq_rvKjBH

友情链接

DOMDocument 和 php html 问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

若能看破又如何

夢野间

doggiejohn

就此别过

初见终念

qq_rvKjBH

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。