如何从 PHP 字符串中提取标题标签?
从包含大量 HTML 的字符串中,如何将
etc
标记中的所有文本提取到新变量中?
我想捕获这些元素中的所有文本并将它们作为逗号分隔值存储在新变量中。
是否可以使用preg_match_all()
?
From a string that contains a lot of HTML, how can I extract all the text from <h1><h2>etc
tags into a new variable?
I would like to capture all of the text from these elements and store them in a new variable as comma-delimited values.
Is it possible using preg_match_all()
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
首先,您需要清理 HTML(示例中的 $html_str):
然后您可以将 XML($xml_str)加载到 DOMDocument 中:
最后您可以使用 Horia Dragomir 的方法:
或者您也可以使用 XPath 来实现更复杂的效果对 DOMDocument 的查询(请参阅 http://www.php.net/manual/en /class.domxpath.php)
First you need to clean up the HTML ($html_str in the example) with tidy:
Then you can load the XML ($xml_str) into a DOMDocument:
And finally you can use Horia Dragomir's method:
Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php)
您可能更适合使用 HTML 解析器。但对于非常简单的场景,可能会这样做:
You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:
我知道这是一篇非常旧的帖子,但是我想提一下我能够集体获取标题标签的最佳方式。
此方法(作为正则表达式工作,但 PHP 的行为略有不同。)
在 preg_match 中使用此方法
$group[1]
将包含标题标记之间的内容。$group[0]
就是一切test
这将考虑空格,如果有人添加“class/id”,则
class/id (group ) 被忽略。
注意:当我分析 HTML 标签时,我总是删除所有空白、换行符、制表符等,并将其替换为 1 个空格。这可以最大限度地减少多行、点...以及大量的空白,在某些情况下可能会扰乱正则表达式格式。
这是测试页面的链接正则表达式测试
I know this is a super old post, however I wanted to mention the best way I was able to collectively grab heading tags.
This method (works as a regex, however PHP acts a bit differently.)
use this in your preg_match
$group[1]
will include what ever is in between the heading tag.$group[0]
is everything<h1>test</h
This will account for spaces, and if someone adds "class/id"
the class/id (group) is ignored.
NOTE: When I analyze HTML tags, I always strip out and replace all White space, line breaks, tabs etc.. with a 1 space. This minimizes multi-lines, dotalls... And very large amounts of white space which in some cases can mess with regex formatting.
Here is a link to the test page regex test
如果您确实想使用正则表达式,我认为:
只要您的标头标签不嵌套,就应该有效。正如其他人所说,如果您无法控制 HTML,那么正则表达式并不是实现此目的的好方法。
If you actually want to use regular expressions, I think that:
should work as long as your header tags are not nested. As others have said, if you're not in control of the HTML, regular expressions are not a great way to do this.
建议不要使用正则表达式来完成这项工作,而使用 SimpleHTMLDOM 解析器
It is recommended not to use regex for this job and use something SimpleHTMLDOM parser
另请考虑本机
DOMDocument
php班级。您可以使用
$domdoc->getElementsByTagName('h1')
获取标题。please also consider the native
DOMDocument
php class.You can use
$domdoc->getElementsByTagName('h1')
to get your headings.我只想分享我的解决方案:
如果找不到标题或类似的内容,此函数将返回一个空数组:
I just want to share my solution:
This function will return an empty array if headings were not found or something like this:
这是一个老问题,因为没有新的答案。我用 php 内置的 dom 解析器写了这个。
this is an old questions, since there is no newer answers. i wrote this with php built in dom parser.