DOMDocument 和 php html 问题
好吧。所以我使用 DOMDocument 来读取 html 文件。我注意到的一件事是,当我这样做时,
$doc = new DOMDocument();
$doc->loadHTML($htmlstring);
$doc->saveHTML();
它会添加 doctype 标头以及 html 和 body 标签。
我已经通过这样做解决了
$doc = new DOMDocument();
$doc->loadXML($htmlstring,LIBXML_NOXMLDECL);
$doc->saveXML();
这个问题,但问题是现在我所有的标签都区分大小写,如果我有多个文档根目录,它就会变得很生气。
是否有替代方法,以便我可以加载部分 html 文件、抓取标签等、替换它们并获取字符串,而无需手动解析文件?
基本上我想要 DOMDocument->loadHTML 的功能,而不需要添加标签和标题。
有什么想法吗?
Alright. So I'm using DOMDocument to read html files. One thing I've noticed is that when I do this
$doc = new DOMDocument();
$doc->loadHTML($htmlstring);
$doc->saveHTML();
it will add on a doctype header, and html and body tags.
I've gotten around this by doing this
$doc = new DOMDocument();
$doc->loadXML($htmlstring,LIBXML_NOXMLDECL);
$doc->saveXML();
The problem with this however is the fact that now all my tags are case sensitive, and it gets mad if I have more than one document root.
Is there an alternative so that I can load up partial html files, grab tags and such, replace them, and get the string without having to parse the files manually?
Basically I want the functionallity of DOMDocument->loadHTML
, without the added tags and header.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
理论上,您可以告诉 libxml 不要添加隐含标记。实际上,PHP 的 libxml 绑定没有为此提供任何方法。如果您使用的是 PHP 5.3.6+,请将部分文档的根节点传递给
saveHTML()
,然后它将为您提供该元素的外层 HTML,例如只会返回
;
带有子元素的元素。请参阅另请注意,您的部分文档具有多个根elements 之所以有效,是因为
loadHTML
添加了隐含的元素。如果您想要返回具有多个根(或者根本没有根)的部分,您可以自己添加一个假根:然后根据需要处理文档,然后获取该假根的innerHTML
另请参阅如何在 PHP 中解析和处理 HTML/XML?可能想尝试
In theory you could tell libxml not to add the implied markup. In practise, PHP's libxml bindings do not provide any means to that. If you are on PHP 5.3.6+ pass the root node of your partial document to
saveHTML()
which will then give you the outerHTML of that element, e.g.would only return the
<body>
element with children. SeeAlso note that your partial document with multiple root elements only works because
loadHTML
adds the implied elements. If you want a partial with multiple roots (or rather no root at all) back, you can add a fake root yourself:Then process the document as needed and then fetch the innerHTML of that fake root
Also see How do you parse and process HTML/XML in PHP? for additional parsers you might want to try
您可以使用一些具有特定 id 的 div,然后从文档对象中,使用其 id 部分提取 div 对象。
You can use some divs with specific id, and then from the document object, partially extract the div object using its id.