PHP Dom Documents：获取文本内容忽略脚本标签和注释

发布于 2024-12-02 01:20:35 字数 441 浏览 0 评论 0原文

我使用 dom doc 从数据库加载 html，如下所示：

$doc = new DOMDocument();
@$doc->loadHTML($data);
$doc->encoding = 'utf-8';
$doc->saveHTML();

然后我通过执行以下操作获取正文文本：

$bodyNodes = $doc->getElementsByTagName("body");
$words = htmlspecialchars($bodyNodes->item(0)->textContent);

我得到的单词包含中的所有内容。还包括诸如之类的内容。我如何删除它们并只保留真实的文本内容？

原文

i uses dom doc to load html from database like this:

$doc = new DOMDocument();
@$doc->loadHTML($data);
$doc->encoding = 'utf-8';
$doc->saveHTML();

Then i get the body text by doing these:

$bodyNodes = $doc->getElementsByTagName("body");
$words = htmlspecialchars($bodyNodes->item(0)->textContent);

The words i've gotten included everything in the <body>. Things like <scripts> were also included.
How do i removed them and keep only the real text content?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如果没结果 2024-12-09 01:20:35

您可以使用 XPath 来实现此目的。

借用 arnaud 在上面的示例中使用的 HTML：

$html = <<< HTML
<p>
    test<span>foo<b>bar</b>
</p>
<script>
    ignored
</script>
<!-- comment is ignored -->
<p>test</p>
HTML;

您只需查询所有文本节点不是不是脚本标签的子标签和不计算为空字符串。您还要确保不 preserveWhiteSpace 因此不考虑用于格式化的空格。

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadHtml($html);

$xp    = new DOMXPath($dom);
$nodes = $xp->query('/html/body//text()[
    not(ancestor::script) and
    not(normalize-space(.) = "")
]');

foreach($nodes as $node) {
    var_dump($node->textContent);
}

将输出（演示）

string(10) "
    test"
string(3) "foo"
string(3) "bar"
string(4) "test"

You can use XPath for this.

Borrowing the HTML arnaud used for his example above:

$html = <<< HTML
<p>
    test<span>foo<b>bar</b>
</p>
<script>
    ignored
</script>
<!-- comment is ignored -->
<p>test</p>
HTML;

You simply query all text nodes that not are not children of a script tag and do not evaluate to an empty string. You'll also make sure you dont preserveWhiteSpace so the whitespace used for formatting isnt considered.

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadHtml($html);

$xp    = new DOMXPath($dom);
$nodes = $xp->query('/html/body//text()[
    not(ancestor::script) and
    not(normalize-space(.) = "")
]');

foreach($nodes as $node) {
    var_dump($node->textContent);
}

will output (demo)

string(10) "
    test"
string(3) "foo"
string(3) "bar"
string(4) "test"

回复收藏 0 原文

蒗幽 2024-12-09 01:20:35

您必须访问所有节点并返回它们的文本。如果其中包含其他节点，也访问它们。

这可以通过以下基本递归算法来完成：

extractNode:
    if node is a text node or a cdata node, return its text
    if is an element node or a document node or a document fragment node:
        if it’s a script node, return an empty string
        return a concatenation of the result of calling extractNode on all the child nodes
    for everything else return nothing

实现：

function extractText($node) {    
    if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
        return $node->nodeValue;
    } else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
        if ('script' === $node->nodeName) return '';

        $text = '';
        foreach($node->childNodes as $childNode) {
            $text .= extractText($childNode);
        }
        return $text;
    }
}

这将返回给定 $node 的 textContent，忽略脚本标签和注释。

$words = htmlspecialchars(extractText($bodyNodes->item(0)));

在这里尝试一下：http://codepad.org/CS3nMp7U

You have to visit all nodes and return their text. If some contain other node, visit them too.

This can be done with this basic recursive algorithm:

extractNode:
    if node is a text node or a cdata node, return its text
    if is an element node or a document node or a document fragment node:
        if it’s a script node, return an empty string
        return a concatenation of the result of calling extractNode on all the child nodes
    for everything else return nothing

Implementation:

function extractText($node) {    
    if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
        return $node->nodeValue;
    } else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
        if ('script' === $node->nodeName) return '';

        $text = '';
        foreach($node->childNodes as $childNode) {
            $text .= extractText($childNode);
        }
        return $text;
    }
}

This will return the textContent of the given $node, ignoring script tags and comments.

$words = htmlspecialchars(extractText($bodyNodes->item(0)));

Try it here: http://codepad.org/CS3nMp7U

回复收藏 0 原文

~没有更多了~

关于作者

剩一世无双

暂无简介

文章

25 人气

关注发私信

燃烧我的卡路李先生

文章 0 评论 0

关注

qq_2gSKZM

文章 0 评论 0

关注

∞梦里开花

文章 0 评论 0

关注

qq_IklFPL

文章 0 评论 0

关注

迷途知返

文章 0 评论 0

关注

深海不蓝

文章 0 评论 0

友情链接

文江博客

PHP Dom Documents：获取文本内容忽略脚本标签和注释

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

PHP Dom Documents：获取文本内容忽略脚本标签和注释

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。