使用 preg_match_all PHP 限制结果数量

发布于 2024-10-08 06:10:31 字数 163 浏览 2 评论 0原文

有没有办法限制使用 preg_match_all 返回的匹配项数量?

例如,我只想匹配网页上的前 20 个

标记,但有 100 个

标记。

干杯

Is there any way to limit the number of matches that will be returned using preg_match_all?

So for example, I want to match only the first 20 <p> tags on a web page but there are 100 <p> tags.

Cheers

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

℉服软 2024-10-15 06:10:31
$matches = array();   
preg_match_all ( $pattern , $subject , $matches );
$twenty = array_slice($matches , 0, 20);
$matches = array();   
preg_match_all ( $pattern , $subject , $matches );
$twenty = array_slice($matches , 0, 20);
居里长安 2024-10-15 06:10:31

只需匹配所有并对结果数组进行切片即可:

$allMatches = array ();
$numMatches = preg_match_all($pattern, $subject, $allMatches, PREG_SET_ORDER);
$limit = 20;
$limitedResults = $allMatches;
if($numMatches > $limit)
{
   $limitedResults = array_slice($allMatches, 0, $limit);
}

// Use $limitedResults here

Just match all and slice the resulting array:

$allMatches = array ();
$numMatches = preg_match_all($pattern, $subject, $allMatches, PREG_SET_ORDER);
$limit = 20;
$limitedResults = $allMatches;
if($numMatches > $limit)
{
   $limitedResults = array_slice($allMatches, 0, $limit);
}

// Use $limitedResults here
没企图 2024-10-15 06:10:31

不可以,preg_match_all结果集的计算不能被限制。之后您只能使用 array_slicearray_splice (这需要 PREG_SET_ORDER):

preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
$firstMatches = array_slice($matches, 0, 20);

但除此之外,您不应该使用正则表达式无论如何都要解析 HTML。虽然现代的正则表达式引擎已经不再是正则的了,可以处理像HTML这样的不规则语言,但是它太容易出错了。最好使用适当的 HTML 解析器,例如 PHP 的 DOM 库 之一。然后使用计数器最多只能获取 20 个匹配项:

$doc = new DOMDocument();
$doc->loadHTML($code);
$counter = 20;
$matches = array();
foreach ($doc->getElementsByTagName('p') as $elem) {
    if ($counter-- <= 0) {
        break;
    }
    $matches[] = $elem;
}

No, the computation of the preg_match_all result set cannot be limited. You can only limit the results afterwards with array_slice or array_splice (this would require PREG_SET_ORDER):

preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
$firstMatches = array_slice($matches, 0, 20);

But besides that, you shouldn’t use regular expressions to parse HTML anyway. Although modern regular expressions engines are not regular any more and can process an irregular language like HTML, it is too error prone. Better use an appropriate HTML parser instead like the one of PHP’s DOM library. Then just use a counter to only get up to 20 matches:

$doc = new DOMDocument();
$doc->loadHTML($code);
$counter = 20;
$matches = array();
foreach ($doc->getElementsByTagName('p') as $elem) {
    if ($counter-- <= 0) {
        break;
    }
    $matches[] = $elem;
}
泪是无色的血 2024-10-15 06:10:31

您可以使用 T-Regx 库:

pattern('<p>')->match($yourHtml)->only(20);

You can use T-Regx library:

pattern('<p>')->match($yourHtml)->only(20);
昇り龍 2024-10-15 06:10:31

为了扩展 @Gumbo 使用 DOM 解析器而不是正则表达式的伟大建议,以下代码片段将使用带有 position() 条件的 XPath 查询来限制目标标签。

代码:(演示定位 5 个 p 标签中的 4 个

$html = <<<HTML
<div>
    <p class="classy">1
</p>
    <p>2</p>
    <p data-p="<p>notatag</p>">3</p>
    <span data-monkeywrench='<p'>z</span>
    <p
 data-p="<p>notatag</p>">4</p>
    <p>5</p>
</div>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//p[position() <= 4]') as $p) {
    echo var_export($p->nodeValue, true) , "\n---\n";
}

输出:

'1
'
---
'2'
---
'3'
---
'4'
---

To extend on @Gumbo's great advice to use a DOM parser instead of regex, the following snippet will use a XPath query with a position() condition to limit the targeted tags.

Code: (Demo targeting 4 of 5 p tags)

$html = <<<HTML
<div>
    <p class="classy">1
</p>
    <p>2</p>
    <p data-p="<p>notatag</p>">3</p>
    <span data-monkeywrench='<p'>z</span>
    <p
 data-p="<p>notatag</p>">4</p>
    <p>5</p>
</div>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//p[position() <= 4]') as $p) {
    echo var_export($p->nodeValue, true) , "\n---\n";
}

Output:

'1
'
---
'2'
---
'3'
---
'4'
---
塔塔猫 2024-10-15 06:10:31

这才是真正的答案;最节省内存的方式。
请改用通过 preg_replace_callback() 进行引用分配

<?php

$matches = [];

preg_replace_callback(
    '~<p(?:\s.*?)?>(?:.*?)</p>~s',
    function (array $match) use (&$matches) {
        $matches[] = $match[0];
    },
    $html,
    20,
    $_
);

var_dump($matches);

This is the true answer; the most memory-efficient way.
Use reference assignment via preg_replace_callback() instead.

<?php

$matches = [];

preg_replace_callback(
    '~<p(?:\s.*?)?>(?:.*?)</p>~s',
    function (array $match) use (&$matches) {
        $matches[] = $match[0];
    },
    $html,
    20,
    $_
);

var_dump($matches);
梦里南柯 2024-10-15 06:10:31

您可以使用 preg_match_all() 并丢弃您不感兴趣的匹配项,也可以使用带有 preg_match() 的循环。如果您担心扫描大字符串的费用,第二个选项会更好。

此示例限制为 2 个匹配项,而整个字符串中实际上有 3 个匹配项:

<?php

$str = "ab1ab2ab3ab4c";

for ($offset = 0, $n = 0;
        $n < 2 && preg_match('/b([0-9])/', $str, $matches, PREG_OFFSET_CAPTURE, $offset);
        ++$n, $offset = $matches[0][1] + 1) {

        var_dump($matches);
}

实际上,while 循环可能比反射上的 for 循环更清晰;)

You can either use preg_match_all() and discard the matches you're not interested in, or you can use a loop with preg_match(). The second option would be better if you're concern about the expense of scanning a large string.

This example limits to 2 matches, when there are actually 3 in the entire string:

<?php

$str = "ab1ab2ab3ab4c";

for ($offset = 0, $n = 0;
        $n < 2 && preg_match('/b([0-9])/', $str, $matches, PREG_OFFSET_CAPTURE, $offset);
        ++$n, $offset = $matches[0][1] + 1) {

        var_dump($matches);
}

Really a while loop would probably have been clearer than a for loop on reflection ;)

︶ ̄淡然 2024-10-15 06:10:31

我不这么认为,但是 preg_match 确实有一个 offset 参数,以及一个 PREG_OFFSET_CAPTURE 标志,组合后可用于获取“下一个匹配”。

如果您不想获取所有结果然后 array_slice() 删除一部分:o)

编辑:
好的,这是一些代码(未经测试或以任何方式使用):

$offset = 0;
$matches = array();
for ($i = 0; $i < 20; $i++) {
    $results = preg_match('/<p(?:.*?)>/', $string, PREG_OFFSET_CAPTURE, $offset);
    if (empty($results)) {
        break;
    } else {
        $matches[] = $results[0][0];
        $offset += $results[0][1];
    }
}

I don't think so, but preg_match does have an offset parameter, and also a PREG_OFFSET_CAPTURE flag which, when combined, can be used to get the "next match".

This is mainly useful if you don't want to get all results and then array_slice() a portion off :o)

EDIT:
Ok, here's some code (not tested or used in any way):

$offset = 0;
$matches = array();
for ($i = 0; $i < 20; $i++) {
    $results = preg_match('/<p(?:.*?)>/', $string, PREG_OFFSET_CAPTURE, $offset);
    if (empty($results)) {
        break;
    } else {
        $matches[] = $results[0][0];
        $offset += $results[0][1];
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文