如何从 PHP 字符串中提取标题标签?

发布于 2024-08-17 16:09:56 字数 179 浏览 9 评论 0原文

从包含大量 HTML 的字符串中,如何将

etc 标记中的所有文本提取到新变量中?

我想捕获这些元素中的所有文本并将它们作为逗号分隔值存储在新变量中。

是否可以使用preg_match_all()

From a string that contains a lot of HTML, how can I extract all the text from <h1><h2>etc tags into a new variable?

I would like to capture all of the text from these elements and store them in a new variable as comma-delimited values.

Is it possible using preg_match_all()?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

迟月 2024-08-24 16:09:56

首先,您需要清理 HTML(示例中的 $html_str):

$tidy_config = array(
    "indent"               => true,
    "output-xml"           => true,
    "output-xhtml"         => false,
    "drop-empty-paras"     => false,
    "hide-comments"        => true,
    "numeric-entities"     => true,
    "doctype"              => "omit",
    "char-encoding"        => "utf8",
    "repeated-attributes"  => "keep-last"
);

$xml_str = tidy_repair_string($html_str, $tidy_config);

然后您可以将 XML($xml_str)加载到 DOMDocument 中:

$doc = DOMDocument::loadXML($xml_str);

最后您可以使用 Horia Dragomir 的方法:

$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
    print($list->item($i)->nodeValue . "<br/>\n");
}

或者您也可以使用 XPath 来实现更复杂的效果对 DOMDocument 的查询(请参阅 http://www.php.net/manual/en /class.domxpath.php)

$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");

First you need to clean up the HTML ($html_str in the example) with tidy:

$tidy_config = array(
    "indent"               => true,
    "output-xml"           => true,
    "output-xhtml"         => false,
    "drop-empty-paras"     => false,
    "hide-comments"        => true,
    "numeric-entities"     => true,
    "doctype"              => "omit",
    "char-encoding"        => "utf8",
    "repeated-attributes"  => "keep-last"
);

$xml_str = tidy_repair_string($html_str, $tidy_config);

Then you can load the XML ($xml_str) into a DOMDocument:

$doc = DOMDocument::loadXML($xml_str);

And finally you can use Horia Dragomir's method:

$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
    print($list->item($i)->nodeValue . "<br/>\n");
}

Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php)

$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");
沙与沫 2024-08-24 16:09:56

您可能更适合使用 HTML 解析器。但对于非常简单的场景,可能会这样做:

if (preg_match_all('/<h\d>([^<]*)<\/h\d>/iU', $str, $matches)) {
    // $matches contains all instances of h1-h6
}

You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:

if (preg_match_all('/<h\d>([^<]*)<\/h\d>/iU', $str, $matches)) {
    // $matches contains all instances of h1-h6
}
眼眸里的那抹悲凉 2024-08-24 16:09:56

我知道这是一篇非常旧的帖子,但是我想提一下我能够集体获取标题标签的最佳方式。

<h1>title</h1> and <h2>title 2</h2>

此方法(作为正则表达式工作,但 PHP 的行为略有不同。)

/<\s*h[1-2](?:.*)>(.*)</\s*h/i

在 preg_match 中使用此方法

|<\s*h[1-2](?:.*)>(.*)</\s*h|Ui

$group[1] 将包含标题标记之间的内容。
$group[0] 就是一切

test

这将考虑空格,如果有人添加“class/id”,则

<h1 class="classname">test</h1>

class/id (group ) 被忽略。

注意:当我分析 HTML 标签时,我总是删除所有空白、换行符、制表符等,并将其替换为 1 个空格。这可以最大限度地减少多行、点...以及大量的空白,在某些情况下可能会扰乱正则表达式格式。

  • 当然,我只抓取 1-2 个标题标签,将其更改为 0-9 以抓取全部。
  • 如果其他人有要添加的模组或修复我的代码,请回复,我真的很想知道。
  • 相反,正则表达式对 HTML 来说很糟糕,这是一个非常开放的论点。因为如果你设计你的 php 函数和正则表达式来完美地去除垃圾并为正则表达式特定表达式准备 html,你将完全能够抓住你正在寻找的东西。您可以制作足够的正则表达式函数来代替业余的 html 工作。

这是测试页面的链接正则表达式测试

I know this is a super old post, however I wanted to mention the best way I was able to collectively grab heading tags.

<h1>title</h1> and <h2>title 2</h2>

This method (works as a regex, however PHP acts a bit differently.)

/<\s*h[1-2](?:.*)>(.*)</\s*h/i

use this in your preg_match

|<\s*h[1-2](?:.*)>(.*)</\s*h|Ui

$group[1] will include what ever is in between the heading tag.
$group[0] is everything <h1>test</h

This will account for spaces, and if someone adds "class/id"

<h1 class="classname">test</h1>

the class/id (group) is ignored.

NOTE: When I analyze HTML tags, I always strip out and replace all White space, line breaks, tabs etc.. with a 1 space. This minimizes multi-lines, dotalls... And very large amounts of white space which in some cases can mess with regex formatting.

  • of course I am only grabbing 1-2 heading tags, change that to 0-9 to grab all.
  • If anyone else has a mod to add or a fix to my code, please respond, I'd really like to know.
  • On the contrary with Regex being bad with HTML, that is a very open argument. Because if you design your php functions, and regex expressions to perfectly strip away the junk and prepare the html for regex specific expressions, You will be perfectly able to grab what you are looking for. You can make enough regex functions to replace amateur html work.

Here is a link to the test page regex test

甜心 2024-08-24 16:09:56

如果您确实想使用正则表达式,我认为:

preg_match_all('/<h[0-6]>([^</h[0-6]>*)</h/i', $string, $matches);

只要您的标头标签不嵌套,就应该有效。正如其他人所说,如果您无法控制 HTML,那么正则表达式并不是实现此目的的好方法。

If you actually want to use regular expressions, I think that:

preg_match_all('/<h[0-6]>([^</h[0-6]>*)</h/i', $string, $matches);

should work as long as your header tags are not nested. As others have said, if you're not in control of the HTML, regular expressions are not a great way to do this.

浅笑轻吟梦一曲 2024-08-24 16:09:56

建议不要使用正则表达式来完成这项工作,而使用 SimpleHTMLDOM 解析器

It is recommended not to use regex for this job and use something SimpleHTMLDOM parser

画尸师 2024-08-24 16:09:56

另请考虑本机 DOMDocument php班级。

您可以使用 $domdoc->getElementsByTagName('h1') 获取标题。

please also consider the native DOMDocument php class.

You can use $domdoc->getElementsByTagName('h1') to get your headings.

橙幽之幻 2024-08-24 16:09:56

我只想分享我的解决方案:

function get_all_headings( $content ) {
    preg_match_all( '/\<(h[1-6])\>(.*)<\/h[1-6]>/i', $content, $matches );

    $r = array();
    if( !empty( $matches[1] ) && !empty( $matches[2] ) ){
        $tags = $matches[1];
        $titles = $matches[2];
        foreach ($tags as $i => $tag) {
            $r[] = array( 'tag' => $tag, 'title' => $titles[ $i ] );
        }
    }

    return $r;
}

如果找不到标题或类似的内容,此函数将返回一个空数组:

array (
    array (
        'tag' => 'h1',
        'title' => 'This is a title',
    ),
    array (
        'tag' => 'h2',
        'title' => 'This is the second title',
    ),
)

I just want to share my solution:

function get_all_headings( $content ) {
    preg_match_all( '/\<(h[1-6])\>(.*)<\/h[1-6]>/i', $content, $matches );

    $r = array();
    if( !empty( $matches[1] ) && !empty( $matches[2] ) ){
        $tags = $matches[1];
        $titles = $matches[2];
        foreach ($tags as $i => $tag) {
            $r[] = array( 'tag' => $tag, 'title' => $titles[ $i ] );
        }
    }

    return $r;
}

This function will return an empty array if headings were not found or something like this:

array (
    array (
        'tag' => 'h1',
        'title' => 'This is a title',
    ),
    array (
        'tag' => 'h2',
        'title' => 'This is the second title',
    ),
)
夜吻♂芭芘 2024-08-24 16:09:56

这是一个老问题,因为没有新的答案。我用 php 内置的 dom 解析器写了这个。

$dom -> loadHTML("your html string here..");
$h2s = $dom -> getElementsByTagName('h2');

foreach ( $h2s as $h2 )
{
  echo $h2 -> nodeValue;
}

this is an old questions, since there is no newer answers. i wrote this with php built in dom parser.

$dom -> loadHTML("your html string here..");
$h2s = $dom -> getElementsByTagName('h2');

foreach ( $h2s as $h2 )
{
  echo $h2 -> nodeValue;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文