DOMDocument 和 HTML 实体

发布于 2024-12-02 07:21:50 字数 538 浏览 1 评论 0原文

我正在尝试解析一些包含一些 HTML 实体的 HTML,例如 ×,

$str = '<a href="http://example.com/"> A &#215; B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    

但 DomDocument 将文本替换为 A × B。

是否有某种方法可以阻止它采用 &对于一个 HTML 实体并让它不管它?我尝试将 ReplaceEntities 设置为 false 但它没有执行任何操作

I'm trying to parse some HTML that includes some HTML entities, like ×

$str = '<a href="http://example.com/"> A × B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    

but DomDocument substitutes the text for for A × B.

Is there some way to keep it from taking the & for an HTML entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

梦屿孤独相伴 2024-12-09 07:21:50

这不是问题的直接答案,但您可以使用 UTF-8 代替,它允许您直接保存 ÷ 或 × 等字形。要将 UTF-8 与 PHP DOM 结合使用,需要一些技巧

另外,如果您想显示数学公式(如 A × B 所示),请查看 MathML

This is no direct answer to the question, but you may use UTF-8 instead, which allows you to save glyphs like ÷ or × directly. To use UTF-8 with PHP DOM on the other needs a little hack.

Also, if you are trying to display mathematical formulas (as A × B suggests) have a look at MathML.

迷爱 2024-12-09 07:21:50

来自文档:

DOM 扩展使用 UTF-8 编码。
使用 utf8_encode() 和 utf8_decode() 处理 ISO-8859-1 编码中的文本,或使用 Iconv 处理其他编码。假设

您正在使用 latin-1 尝试:

<?php
header('Content-type:text/html;charset=iso-8859-1');


$str = utf8_encode('<a href="http://example.com/"> A × B</a>');

$dom = new DOMDocument;


$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    ?>

From the docs:

The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

Assuming you're using latin-1 try:

<?php
header('Content-type:text/html;charset=iso-8859-1');


$str = utf8_encode('<a href="http://example.com/"> A × B</a>');

$dom = new DOMDocument;


$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    ?>
止于盛夏 2024-12-09 07:21:50

您确定&吗?是否被替换为 &?如果是这种情况,您会看到确切的实体,作为文本,而不是您得到的乱码响应。

我的猜测是它被转换为实际的字符,并且您正在使用 latin1 字符集查看页面,该字符集不包含该字符,因此会出现乱码响应。

如果我呈现您的示例,我的输出是:

fullname:  A × B 

href: http://example.com/

在 latin1/iso-8859-1 中查看此示例时,我看到您所描述的输出。但是当我将字符集设置为UTF-8时,输出就很好。

Are you sure the & is being substituted to &? If that were the case, you'd see the exact entity, as text, not the garbled response you're getting.

My guess is that it is converted to the actual character, and you're viewing the page with a latin1 charset, which does not contain this character, hence the garbled response.

If I render your example, my output is:

fullname:  A × B 

href: http://example.com/

When viewing this in latin1/iso-8859-1, I see the output you're describing. But when I set the charset to UTF-8, the output is fine.

野稚 2024-12-09 07:21:50

我通过将 UTF-8 转换为带有 BOM 的 UTF-8 解决了实体损坏的问题。

I fixed my problem with broken entities by converting UTF-8 to UTF-8 with BOM.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文