当前位置：文江博客话题详情

如何从 PHP 字符串中提取标题标签？

发布于 2024-08-17 16:09:56 字数 179 浏览 11 评论 0原文

从包含大量 HTML 的字符串中，如何将

`etc` 标记中的所有文本提取到新变量中？

我想捕获这些元素中的所有文本并将它们作为逗号分隔值存储在新变量中。

是否可以使用preg_match_all()？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迟月 2024-08-24 16:09:56

首先，您需要清理 HTML（示例中的 $html_str）：

$tidy_config = array(
    "indent"               => true,
    "output-xml"           => true,
    "output-xhtml"         => false,
    "drop-empty-paras"     => false,
    "hide-comments"        => true,
    "numeric-entities"     => true,
    "doctype"              => "omit",
    "char-encoding"        => "utf8",
    "repeated-attributes"  => "keep-last"
);

$xml_str = tidy_repair_string($html_str, $tidy_config);

然后您可以将 XML（$xml_str）加载到 DOMDocument 中：

$doc = DOMDocument::loadXML($xml_str);

最后您可以使用 Horia Dragomir 的方法：

$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
    print($list->item($i)->nodeValue . "<br/>\n");
}

或者您也可以使用 XPath 来实现更复杂的效果对 DOMDocument 的查询（请参阅 http://www.php.net/manual/en /class.domxpath.php)

$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");

First you need to clean up the HTML ($html_str in the example) with tidy:

$tidy_config = array(
    "indent"               => true,
    "output-xml"           => true,
    "output-xhtml"         => false,
    "drop-empty-paras"     => false,
    "hide-comments"        => true,
    "numeric-entities"     => true,
    "doctype"              => "omit",
    "char-encoding"        => "utf8",
    "repeated-attributes"  => "keep-last"
);

$xml_str = tidy_repair_string($html_str, $tidy_config);

Then you can load the XML ($xml_str) into a DOMDocument:

$doc = DOMDocument::loadXML($xml_str);

And finally you can use Horia Dragomir's method:

$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
    print($list->item($i)->nodeValue . "<br/>\n");
}

Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php)

$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");

回复收藏 0 原文

沙与沫 2024-08-24 16:09:56

您可能更适合使用 HTML 解析器。但对于非常简单的场景，可能会这样做：

if (preg_match_all('/<h\d>([^<]*)<\/h\d>/iU', $str, $matches)) {
    // $matches contains all instances of h1-h6
}

You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:

if (preg_match_all('/<h\d>([^<]*)<\/h\d>/iU', $str, $matches)) {
    // $matches contains all instances of h1-h6
}

回复收藏 0 原文

眼眸里的那抹悲凉 2024-08-24 16:09:56

我知道这是一篇非常旧的帖子，但是我想提一下我能够集体获取标题标签的最佳方式。

<h1>title</h1> and <h2>title 2</h2>

此方法（作为正则表达式工作，但 PHP 的行为略有不同。）

/<\s*h[1-2](?:.*)>(.*)</\s*h/i

在 preg_match 中使用此方法

|<\s*h[1-2](?:.*)>(.*)</\s*h|Ui

$group[1] 将包含标题标记之间的内容。
$group[0] 就是一切

test

这将考虑空格，如果有人添加“class/id”，则

<h1 class="classname">test</h1>

class/id (group ) 被忽略。

注意：当我分析 HTML 标签时，我总是删除所有空白、换行符、制表符等，并将其替换为 1 个空格。这可以最大限度地减少多行、点...以及大量的空白，在某些情况下可能会扰乱正则表达式格式。

当然，我只抓取 1-2 个标题标签，将其更改为 0-9 以抓取全部。
如果其他人有要添加的模组或修复我的代码，请回复，我真的很想知道。
相反，正则表达式对 HTML 来说很糟糕，这是一个非常开放的论点。因为如果你设计你的 php 函数和正则表达式来完美地去除垃圾并为正则表达式特定表达式准备 html，你将完全能够抓住你正在寻找的东西。您可以制作足够的正则表达式函数来代替业余的 html 工作。

这是测试页面的链接正则表达式测试

I know this is a super old post, however I wanted to mention the best way I was able to collectively grab heading tags.

<h1>title</h1> and <h2>title 2</h2>

This method (works as a regex, however PHP acts a bit differently.)

/<\s*h[1-2](?:.*)>(.*)</\s*h/i

use this in your preg_match

|<\s*h[1-2](?:.*)>(.*)</\s*h|Ui

$group[1] will include what ever is in between the heading tag.
$group[0] is everything <h1>test</h

This will account for spaces, and if someone adds "class/id"

<h1 class="classname">test</h1>

the class/id (group) is ignored.

NOTE: When I analyze HTML tags, I always strip out and replace all White space, line breaks, tabs etc.. with a 1 space. This minimizes multi-lines, dotalls... And very large amounts of white space which in some cases can mess with regex formatting.

of course I am only grabbing 1-2 heading tags, change that to 0-9 to grab all.
If anyone else has a mod to add or a fix to my code, please respond, I'd really like to know.
On the contrary with Regex being bad with HTML, that is a very open argument. Because if you design your php functions, and regex expressions to perfectly strip away the junk and prepare the html for regex specific expressions, You will be perfectly able to grab what you are looking for. You can make enough regex functions to replace amateur html work.

Here is a link to the test page regex test

回复收藏 0 原文

甜心 2024-08-24 16:09:56

如果您确实想使用正则表达式，我认为：

preg_match_all('/<h[0-6]>([^</h[0-6]>*)</h/i', $string, $matches);

只要您的标头标签不嵌套，就应该有效。正如其他人所说，如果您无法控制 HTML，那么正则表达式并不是实现此目的的好方法。

If you actually want to use regular expressions, I think that:

preg_match_all('/<h[0-6]>([^</h[0-6]>*)</h/i', $string, $matches);

should work as long as your header tags are not nested. As others have said, if you're not in control of the HTML, regular expressions are not a great way to do this.

回复收藏 0 原文

浅笑轻吟梦一曲 2024-08-24 16:09:56

建议不要使用正则表达式来完成这项工作，而使用 SimpleHTMLDOM 解析器

回复收藏 0 原文

画尸师 2024-08-24 16:09:56

另请考虑本机 DOMDocument php班级。

您可以使用 $domdoc->getElementsByTagName('h1') 获取标题。

回复收藏 0 原文

橙幽之幻 2024-08-24 16:09:56

我只想分享我的解决方案：

function get_all_headings( $content ) {
    preg_match_all( '/\<(h[1-6])\>(.*)<\/h[1-6]>/i', $content, $matches );

    $r = array();
    if( !empty( $matches[1] ) && !empty( $matches[2] ) ){
        $tags = $matches[1];
        $titles = $matches[2];
        foreach ($tags as $i => $tag) {
            $r[] = array( 'tag' => $tag, 'title' => $titles[ $i ] );
        }
    }

    return $r;
}

如果找不到标题或类似的内容，此函数将返回一个空数组：

array (
    array (
        'tag' => 'h1',
        'title' => 'This is a title',
    ),
    array (
        'tag' => 'h2',
        'title' => 'This is the second title',
    ),
)

I just want to share my solution:

function get_all_headings( $content ) {
    preg_match_all( '/\<(h[1-6])\>(.*)<\/h[1-6]>/i', $content, $matches );

    $r = array();
    if( !empty( $matches[1] ) && !empty( $matches[2] ) ){
        $tags = $matches[1];
        $titles = $matches[2];
        foreach ($tags as $i => $tag) {
            $r[] = array( 'tag' => $tag, 'title' => $titles[ $i ] );
        }
    }

    return $r;
}

This function will return an empty array if headings were not found or something like this:

array (
    array (
        'tag' => 'h1',
        'title' => 'This is a title',
    ),
    array (
        'tag' => 'h2',
        'title' => 'This is the second title',
    ),
)

回复收藏 0 原文

夜吻♂芭芘 2024-08-24 16:09:56

这是一个老问题，因为没有新的答案。我用 php 内置的 dom 解析器写了这个。

$dom -> loadHTML("your html string here..");
$h2s = $dom -> getElementsByTagName('h2');

foreach ( $h2s as $h2 )
{
  echo $h2 -> nodeValue;
}

this is an old questions, since there is no newer answers. i wrote this with php built in dom parser.

$dom -> loadHTML("your html string here..");
$h2s = $dom -> getElementsByTagName('h2');

foreach ( $h2s as $h2 )
{
  echo $h2 -> nodeValue;
}

回复收藏 0 原文

~没有更多了~

关于作者

世俗缘

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何从 PHP 字符串中提取标题标签？

`etc` 标记中的所有文本提取到新变量中？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

test

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

如何从 PHP 字符串中提取标题标签？

etc 标记中的所有文本提取到新变量中？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

test

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

`etc` 标记中的所有文本提取到新变量中？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。