如何使用 PHP 从 HTML 文件中提取所有文本?

发布于 2024-08-07 01:56:08 字数 614 浏览 9 评论 0原文

如何从 HTML 文件中提取所有文本

我想提取 alt 属性中的所有文本,< p>标签等..

但是我不想提取样式和脚本标签之间的文本

谢谢,

现在我有以下代码

    <?PHP
    $string =  trim(clean(strtolower(strip_tags($html_content))));
    $arr = explode(" ", $string);
    $count = array_count_values($arr);
    foreach($count as $value => $freq) {
          echo trim ($value)."---".$freq."<br>";
    }

    function clean($in){
           return preg_replace("/[^a-z]+/i", " ", $in);
    }

    ?>

这很好用,但它检索我不想检索的脚本和样式标签 另一个问题是我不确定它是否确实检索像 alt 这样的属性 - 因为 strip_tags 函数可能会删除所有 HTML 标签及其属性

谢谢

how to extract all text from HTML file

I want to extract all text, in the alt attributes, < p > tags, etc..

however I don't want to extract the text between style and script tags

Thanks

right now I have the following code

    <?PHP
    $string =  trim(clean(strtolower(strip_tags($html_content))));
    $arr = explode(" ", $string);
    $count = array_count_values($arr);
    foreach($count as $value => $freq) {
          echo trim ($value)."---".$freq."<br>";
    }

    function clean($in){
           return preg_replace("/[^a-z]+/i", " ", $in);
    }

    ?>

This works great but it retrieves script and style tags which I don't want to retrieve
and the other problem I am not sure if it does retrieve attributes like alt - since strip_tags function might remove all HTML tags with their attributes

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

聆听风音 2024-08-14 01:56:08

我个人认为您应该切换到某种 XML 阅读器 (SimpleXML, 文档对象模型XMLReader ) 来解析 HTML 文档。我会选择 DOMSimpleXMLXPath 来提取您需要的内容 - 在解析任意文档时,其他一切都会惨败:

$dom = new DOMDocument();
$dom->loadHTML($html_content); // use DOMDocument because it can load HTML
$xml = simplexml_import_dom($dom); // switch to SimpleXML because it's easier to use.
$pTags = $xml->xpath('/html/body//p');
$tagsWithAltAttribute = $xml->xpath('/html/body//*[@alt]');
// ...

I personally think you should switch to an XML reader of some sort (SimpleXML, Document Object Model or XMLReader) to parse the HTML document. I'd go for a mix of DOM, SimpleXML and XPath to extract what you need - everthing else will miserably fail when parsing arbitrary documents:

$dom = new DOMDocument();
$dom->loadHTML($html_content); // use DOMDocument because it can load HTML
$xml = simplexml_import_dom($dom); // switch to SimpleXML because it's easier to use.
$pTags = $xml->xpath('/html/body//p');
$tagsWithAltAttribute = $xml->xpath('/html/body//*[@alt]');
// ...
旧人九事 2024-08-14 01:56:08

首先删除包含完整内容的脚本和样式标签,然后使用当前的清理标签方式,您将获得文本。

First remove script and style tags with full content, then use your current way of cleaning tags and you'll get the text.

迷荒 2024-08-14 01:56:08

首先,您可以搜索 和 块并将它们从 html 中删除。

我有这个函数,我经常使用

        function search($start,$end,$string, $borders=true){
            $reg="!".preg_quote($start)."(.*?)".preg_quote($end)."!is";
            preg_match_all($reg,$string,$matches);

            if($borders) return $matches[0];    
            else return $matches[1];    
        }

该函数将返回数组中的匹配块。

$array = search("<script>" , "</script>" , $html)

脚本和样式消失后,使用 strip_tags 获取文本

first you can search for the and blocks and remove them from the html.

i have this function i use alot

        function search($start,$end,$string, $borders=true){
            $reg="!".preg_quote($start)."(.*?)".preg_quote($end)."!is";
            preg_match_all($reg,$string,$matches);

            if($borders) return $matches[0];    
            else return $matches[1];    
        }

the function will return matching blocks in array.

$array = search("<script>" , "</script>" , $html)

once you have the script and styles gone , use strip_tags to get the text

a√萤火虫的光℡ 2024-08-14 01:56:08

只要您不能确定源是 100% 格式良好的 XML(根据定义,HTML4 就不是),任何类型的解析都不是一种选择。

一个简单的 preg_replace 就足够了。类似的东西

preg_replace('/<(script|style).*>.*<\/\1>/i', '', $html);

应该足以用空字符串替换所有脚本和样式元素及其内容(即剥离它们)。

但是,如果您想避免 XSS 攻击,最好使用 HTML 清理程序来规范 HTML,然后删除所有不良代码。

Any kind of parsing is not an option as long as you can't be sure the source is 100% well-formed XML (which HTML4, by definition, is not).

A simple preg_replace should suffice. Something like

preg_replace('/<(script|style).*>.*<\/\1>/i', '', $html);

should be enough to replace all the script and style elements and their contents with an empty string (i.e. strip them).

If you want to avoid XSS attacks, however, you're probably better off using a HTML sanitiser to normalise the HTML and then strip all the bad code.

陌路黄昏 2024-08-14 01:56:08

我将此作为对另一篇文章的回答,但这里又是这样:

我们刚刚在 repustate.com。使用 REST API(因此只需使用curl 就可以了),您可以清理任何 HTML 或 PDF 并仅返回文本部分。我们的 API 是免费的,因此您可以随意使用。检查一下并将结果与​​ readability.js 进行比较 - 我想您会发现它们几乎 100% 相同。

I posted this as an answer to another post, but here it is again:

We've just launched a new natural language processing API over at repustate.com. Using a REST API (so just using curl will be fine), you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文