如何防止 Php 的 DOMDocument 编码 html 实体?
我有一个函数,可以使用 Php 的 DOMDocument 替换字符串中锚点的 href 属性。 这是一个片段:
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($text);
$anchors = $doc->getElementsByTagName('a');
foreach($anchors as $a) {
$a->setAttribute('href', 'http://google.com');
}
return $doc->saveHTML();
问题是 loadHTML($text) 包围了 doctype、html、body 等标记中的 $text。 我尝试通过这样做而不是 loadHTML() 来解决这个问题:
$doc = new DOMDocument('1.0', 'UTF-8');
$node = $doc->createTextNode($text);
$doc->appendChild($node);
...
不幸的是,这对所有实体(包括锚点)进行了编码。 有谁知道如何关闭此功能? 我已经彻底浏览了文档并尝试破解它,但无法弄清楚。
谢谢! :)
I have a function that replaces anchors' href attribute in a string using Php's DOMDocument. Here's a snippet:
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($text);
$anchors = $doc->getElementsByTagName('a');
foreach($anchors as $a) {
$a->setAttribute('href', 'http://google.com');
}
return $doc->saveHTML();
The problem is that loadHTML($text) surrounds the $text in doctype, html, body, etc. tags. I tried working around this by doing this instead of loadHTML():
$doc = new DOMDocument('1.0', 'UTF-8');
$node = $doc->createTextNode($text);
$doc->appendChild($node);
...
Unfortunately, this encodes all the entities (anchors included). Does anyone know how to turn this off? I've already thoroughly looked through the docs and tried hacking it, but can't figure it out.
Thanks! :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果这些占位符具有严格、定义良好的格式,则可以使用简单的 preg_replace 或 preg_replace_callback 可能可以解决问题。
一般来说,我不建议使用正则表达式来处理 html 文档,但对于一个定义明确的小子集来说,它们是合适的。
If these place holders have a strict, well-defined format a simple preg_replace or preg_replace_callback might do the trick.
I do not suggest fiddling about html documents with regex in general, but for a small well-defined subset they are suitable.
XML 只有很少的预定义实体 。 所有 html 实体都在其他地方定义。 当您使用 loadhtml() 时,这些实体定义会自动加载,而使用 loadxml() (或根本不使用 load())则不会自动加载。
createTextNode() 的作用正如其名称所示。 您作为值传递的所有内容都将被视为文本内容,而不是标记。 即,如果您传递对标记具有特殊含义的内容(<、>、...),则它会以解析器可以区分文本与实际标记的方式进行编码(<、>、 ...)
$text 从哪里来? 你不能在实际的html文档中进行替换吗?
XML has only very few predefined entities. All you html entities are defined somewhere else. When you use loadhtml() these entity definitions are load automagically, with loadxml() (or no load() at all) they are not.
createTextNode() does exactly what the name suggests. Everything you pass as value is treated as text content, not as markup. I.e. if you pass something that has a special meaning to the markup (<, >, ...) it's encoded in a way a parser can distinguish the text from the actual markup (<, >, ...)
Where does $text come from? Can't you do the replacement within the actual html document?
对于这个问题,这里有一个不那么棘手的解决方案,但它工作得很好。
Here is a little less hacky solution for this issue, but it works perfectly.
我最终以一种脆弱的方式破解了这个,将:更改
为:
这消除了所有不必要的垃圾,将:更改
为:
有人能想出更好的东西吗?
I ended up hacking this in a tenuous way, changing:
into:
This cuts out all the unnecessary garbage, changing this:
into this:
Can anyone figure out something better?
好的,这是我最终得到的解决方案。 决定采纳 VolkerK 的建议。
这允许人们使用一组不同的锚属性来调用该函数。
OK, here's the final solution I ended up with. Decided to go with VolkerK's suggestion.
This allows one to call the function with a set of different anchor attributes.