DOMDocument 和 HTML 实体
我正在尝试解析一些包含一些 HTML 实体的 HTML,例如 ×,
$str = '<a href="http://example.com/"> A × B</a>';
$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);
$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');
echo "
fullname: $fullname \n
href: $href\n";
但 DomDocument 将文本替换为 A × B。
是否有某种方法可以阻止它采用 &对于一个 HTML 实体并让它不管它?我尝试将 ReplaceEntities 设置为 false 但它没有执行任何操作
I'm trying to parse some HTML that includes some HTML entities, like ×
$str = '<a href="http://example.com/"> A × B</a>';
$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);
$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');
echo "
fullname: $fullname \n
href: $href\n";
but DomDocument substitutes the text for for A × B.
Is there some way to keep it from taking the & for an HTML entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这不是问题的直接答案,但您可以使用 UTF-8 代替,它允许您直接保存 ÷ 或 × 等字形。要将 UTF-8 与 PHP DOM 结合使用,需要一些技巧 。
另外,如果您想显示数学公式(如 A × B 所示),请查看 MathML 。
This is no direct answer to the question, but you may use UTF-8 instead, which allows you to save glyphs like ÷ or × directly. To use UTF-8 with PHP DOM on the other needs a little hack.
Also, if you are trying to display mathematical formulas (as A × B suggests) have a look at MathML.
来自文档:
DOM 扩展使用 UTF-8 编码。
使用 utf8_encode() 和 utf8_decode() 处理 ISO-8859-1 编码中的文本,或使用 Iconv 处理其他编码。假设
您正在使用 latin-1 尝试:
From the docs:
The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
Assuming you're using latin-1 try:
您确定&吗?是否被替换为
&
?如果是这种情况,您会看到确切的实体,作为文本,而不是您得到的乱码响应。我的猜测是它被转换为实际的字符,并且您正在使用 latin1 字符集查看页面,该字符集不包含该字符,因此会出现乱码响应。
如果我呈现您的示例,我的输出是:
在 latin1/iso-8859-1 中查看此示例时,我看到您所描述的输出。但是当我将字符集设置为UTF-8时,输出就很好。
Are you sure the & is being substituted to
&
? If that were the case, you'd see the exact entity, as text, not the garbled response you're getting.My guess is that it is converted to the actual character, and you're viewing the page with a latin1 charset, which does not contain this character, hence the garbled response.
If I render your example, my output is:
When viewing this in latin1/iso-8859-1, I see the output you're describing. But when I set the charset to UTF-8, the output is fine.
我通过将 UTF-8 转换为带有 BOM 的 UTF-8 解决了实体损坏的问题。
I fixed my problem with broken entities by converting UTF-8 to UTF-8 with BOM.