DOMDocument 和 HTML 实体

发布于 2024-12-02 07:21:50 字数 538 浏览 1 评论 0原文

我正在尝试解析一些包含一些 HTML 实体的 HTML，例如 ×，

$str = '<a href="http://example.com/"> A &#215; B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";

但 DomDocument 将文本替换为 A × B。

是否有某种方法可以阻止它采用 &对于一个 HTML 实体并让它不管它？我尝试将 ReplaceEntities 设置为 false 但它没有执行任何操作

原文

I'm trying to parse some HTML that includes some HTML entities, like ×

$str = '<a href="http://example.com/"> A × B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";

but DomDocument substitutes the text for for A Ã— B.

Is there some way to keep it from taking the & for an HTML entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦屿孤独相伴 2024-12-09 07:21:50

这不是问题的直接答案，但您可以使用 UTF-8 代替，它允许您直接保存 ÷ 或 × 等字形。要将 UTF-8 与 PHP DOM 结合使用，需要一些技巧。

另外，如果您想显示数学公式（如 A × B 所示），请查看 MathML 。

回复收藏 0 原文

迷爱 2024-12-09 07:21:50

来自文档：

DOM 扩展使用 UTF-8 编码。
使用 utf8_encode() 和 utf8_decode() 处理 ISO-8859-1 编码中的文本，或使用 Iconv 处理其他编码。假设

您正在使用 latin-1 尝试：

<?php
header('Content-type:text/html;charset=iso-8859-1');


$str = utf8_encode('<a href="http://example.com/"> A × B</a>');

$dom = new DOMDocument;


$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    ?>

From the docs:

The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

Assuming you're using latin-1 try:

<?php
header('Content-type:text/html;charset=iso-8859-1');


$str = utf8_encode('<a href="http://example.com/"> A × B</a>');

$dom = new DOMDocument;


$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    ?>

回复收藏 0 原文

止于盛夏 2024-12-09 07:21:50

您确定&吗？是否被替换为 &？如果是这种情况，您会看到确切的实体，作为文本，而不是您得到的乱码响应。

我的猜测是它被转换为实际的字符，并且您正在使用 latin1 字符集查看页面，该字符集不包含该字符，因此会出现乱码响应。

如果我呈现您的示例，我的输出是：

fullname:  A × B 

href: http://example.com/

在 latin1/iso-8859-1 中查看此示例时，我看到您所描述的输出。但是当我将字符集设置为UTF-8时，输出就很好。

Are you sure the & is being substituted to &? If that were the case, you'd see the exact entity, as text, not the garbled response you're getting.

My guess is that it is converted to the actual character, and you're viewing the page with a latin1 charset, which does not contain this character, hence the garbled response.

If I render your example, my output is:

fullname:  A × B 

href: http://example.com/

When viewing this in latin1/iso-8859-1, I see the output you're describing. But when I set the charset to UTF-8, the output is fine.

回复收藏 0 原文

野稚 2024-12-09 07:21:50

我通过将 UTF-8 转换为带有 BOM 的 UTF-8 解决了实体损坏的问题。

回复收藏 0 原文

~没有更多了~

关于作者

哥，最终变帅啦

暂无简介

0 文章

0 评论

25 人气

关注发私信

友情链接

文江博客

DOMDocument 和 HTML 实体

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

DOMDocument 和 HTML 实体

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。