如何使用 PHP 从 HTML 文件中提取所有文本?
如何从 HTML 文件中提取所有文本
我想提取 alt 属性中的所有文本,< p>标签等..
但是我不想提取样式和脚本标签之间的文本
谢谢,
现在我有以下代码
<?PHP
$string = trim(clean(strtolower(strip_tags($html_content))));
$arr = explode(" ", $string);
$count = array_count_values($arr);
foreach($count as $value => $freq) {
echo trim ($value)."---".$freq."<br>";
}
function clean($in){
return preg_replace("/[^a-z]+/i", " ", $in);
}
?>
这很好用,但它检索我不想检索的脚本和样式标签 另一个问题是我不确定它是否确实检索像 alt 这样的属性 - 因为 strip_tags 函数可能会删除所有 HTML 标签及其属性
谢谢
how to extract all text from HTML file
I want to extract all text, in the alt attributes, < p > tags, etc..
however I don't want to extract the text between style and script tags
Thanks
right now I have the following code
<?PHP
$string = trim(clean(strtolower(strip_tags($html_content))));
$arr = explode(" ", $string);
$count = array_count_values($arr);
foreach($count as $value => $freq) {
echo trim ($value)."---".$freq."<br>";
}
function clean($in){
return preg_replace("/[^a-z]+/i", " ", $in);
}
?>
This works great but it retrieves script and style tags which I don't want to retrieve
and the other problem I am not sure if it does retrieve attributes like alt - since strip_tags function might remove all HTML tags with their attributes
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我个人认为您应该切换到某种 XML 阅读器 (
SimpleXML
,文档对象模型
或XMLReader
) 来解析 HTML 文档。我会选择DOM
、SimpleXML
和 XPath 来提取您需要的内容 - 在解析任意文档时,其他一切都会惨败:I personally think you should switch to an XML reader of some sort (
SimpleXML
,Document Object Model
orXMLReader
) to parse the HTML document. I'd go for a mix ofDOM
,SimpleXML
and XPath to extract what you need - everthing else will miserably fail when parsing arbitrary documents:首先删除包含完整内容的脚本和样式标签,然后使用当前的清理标签方式,您将获得文本。
First remove script and style tags with full content, then use your current way of cleaning tags and you'll get the text.
首先,您可以搜索 和 块并将它们从 html 中删除。
我有这个函数,我经常使用
该函数将返回数组中的匹配块。
脚本和样式消失后,使用 strip_tags 获取文本
first you can search for the and blocks and remove them from the html.
i have this function i use alot
the function will return matching blocks in array.
once you have the script and styles gone , use strip_tags to get the text
只要您不能确定源是 100% 格式良好的 XML(根据定义,HTML4 就不是),任何类型的解析都不是一种选择。
一个简单的 preg_replace 就足够了。类似的东西
应该足以用空字符串替换所有脚本和样式元素及其内容(即剥离它们)。
但是,如果您想避免 XSS 攻击,最好使用 HTML 清理程序来规范 HTML,然后删除所有不良代码。
Any kind of parsing is not an option as long as you can't be sure the source is 100% well-formed XML (which HTML4, by definition, is not).
A simple preg_replace should suffice. Something like
should be enough to replace all the script and style elements and their contents with an empty string (i.e. strip them).
If you want to avoid XSS attacks, however, you're probably better off using a HTML sanitiser to normalise the HTML and then strip all the bad code.
我将此作为对另一篇文章的回答,但这里又是这样:
我们刚刚在 repustate.com。使用 REST API(因此只需使用curl 就可以了),您可以清理任何 HTML 或 PDF 并仅返回文本部分。我们的 API 是免费的,因此您可以随意使用。检查一下并将结果与 readability.js 进行比较 - 我想您会发现它们几乎 100% 相同。
I posted this as an answer to another post, but here it is again:
We've just launched a new natural language processing API over at repustate.com. Using a REST API (so just using curl will be fine), you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.