Preg_replace的效率

发布于 2024-09-14 19:47:57 字数 3793 浏览 4 评论 0原文

执行摘要：

preg_replace() 比字符串比较运行得更快。为什么？正则表达式不应该慢一些吗？

在最近的问题中关于检测给定输入中任何不允许的子字符串数组，我建议将 preg_replace() 调用的结果与原始输入进行比较，因为 preg_replace() code> 可以将模式数组作为输入。因此，我的方法可能是单个 if，而其他解决方案需要一个（或多个）循环。

我对争论我的答案不感兴趣，因为实际上它比循环的可读性/可维护性更差。我的答案仍然是-1，为了可读性/易于维护，我会接受这一点，但我的方法指出的最大错误是缺乏效率。这引起了我的好奇，并促使我做了一些测试。我的结果让我有点惊讶：在所有其他因素相同的情况下，preg_replace() 比任何其他方法更快。

你能解释一下为什么会出现这种情况吗？

下面可以找到我的这些测试代码以及结果

$input = "In a recent question about detecting any of an array of disallowed substrings within a given input, I suggested comparing the result of a `preg_replace()` call to the original input, since `preg_replace()` can take an array of patterns as input. Thus my method for this could be a single `if` whereas the other solutions required one (or many) loops. I'm not interested in debating my answer, because really it is less readable/maintainable than the loops. However, the biggest fault pointed out with my method was a lack of efficiency. That got me curious, and led me to do some testing. My results were a bit surprising to me: with all other factors held equal, `preg_replace()` was **faster** than any of the other methods. Can you explain why this was the case?";
$input2 = "Short sentence - no matches";
$input3 = "Word";
$input4 = "Short sentence - matches loop";

$start1 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$p_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (str_check($rejectedStrs, $input)) $p_matches++;
    if (str_check($rejectedStrs, $input2)) $p_matches++;
    if (str_check($rejectedStrs, $input3)) $p_matches++;
    if (str_check($rejectedStrs, $input4)) $p_matches++;
}

$start2 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$l_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (loop_check($rejectedStrs, $input)) $l_matches++;
    if (loop_check($rejectedStrs, $input2)) $l_matches++;
    if (loop_check($rejectedStrs, $input3)) $l_matches++;
    if (loop_check($rejectedStrs, $input4)) $l_matches++;
}

$start3 = microtime(true);
$rejectedStrs = array("/loop/", "/efficiency/", "/explain/");
$s_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (preg_check($rejectedStrs, $input)) $s_matches++;
    if (preg_check($rejectedStrs, $input2)) $s_matches++;
    if (preg_check($rejectedStrs, $input3)) $s_matches++;
    if (preg_check($rejectedStrs, $input4)) $s_matches++;
}

$end = microtime(true);
echo $p_matches." ".$l_matches." ".$s_matches."\n";
echo "str_match: ".$start1." ".$start2."= ".($start2-$start1)."\nloop_match: ".$start2." ".$start3."=".($start3-$start2)."\npreg_match: ".$start3." ".$end."=".($end-$start3);

function preg_check($rejectedStrs, $input) {
    if($input == preg_replace($rejectedStrs, "", $input)) 
        return true;
    return false;
}

function loop_check($badwords, $string) {

    foreach (str_word_count($string, 1) as $word) {
        foreach ($badwords as $bw) {
            if (stripos($word, $bw) === 0) {
                return false;
            }
        }
    }
    return true;
}

function str_check($badwords, $str) {
    foreach ($badwords as $word) {
        if (stripos(" $str ", " $word ") !== false) {
            return false;
        }
    }
    return true;
}

：

20000 20000 20000
str_match: 1282270516.6934 1282270518.5881= 1.894730091095
循环匹配：1282270518.5881 1282270523.0943=4.5061857700348
preg_match：1282270523.0943 1282270523.6191=0.52475500106812

原文

Executive Summary:

preg_replace() ran faster than string comparisons. Why? Shouldn't regular expressions be slower?

In a recent question about detecting any of an array of disallowed substrings within a given input, I suggested comparing the result of a preg_replace() call to the original input, since preg_replace() can take an array of patterns as input. Thus my method for this could be a single if whereas the other solutions required one (or many) loops.

I'm not interested in debating my answer, because really it is less readable/maintainable than the loops. My answer there still holds a -1, and I'll accept that for readability/ease of maintenance, but the biggest fault pointed out with my method was a lack of efficiency. That got me curious, and led me to do some testing. My results were a bit surprising to me: with all other factors held equal, preg_replace() was faster than any of the other methods.

Can you explain why this was the case?

My code for these tests can be found below, along with the results:

$input = "In a recent question about detecting any of an array of disallowed substrings within a given input, I suggested comparing the result of a `preg_replace()` call to the original input, since `preg_replace()` can take an array of patterns as input. Thus my method for this could be a single `if` whereas the other solutions required one (or many) loops. I'm not interested in debating my answer, because really it is less readable/maintainable than the loops. However, the biggest fault pointed out with my method was a lack of efficiency. That got me curious, and led me to do some testing. My results were a bit surprising to me: with all other factors held equal, `preg_replace()` was **faster** than any of the other methods. Can you explain why this was the case?";
$input2 = "Short sentence - no matches";
$input3 = "Word";
$input4 = "Short sentence - matches loop";

$start1 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$p_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (str_check($rejectedStrs, $input)) $p_matches++;
    if (str_check($rejectedStrs, $input2)) $p_matches++;
    if (str_check($rejectedStrs, $input3)) $p_matches++;
    if (str_check($rejectedStrs, $input4)) $p_matches++;
}

$start2 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$l_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (loop_check($rejectedStrs, $input)) $l_matches++;
    if (loop_check($rejectedStrs, $input2)) $l_matches++;
    if (loop_check($rejectedStrs, $input3)) $l_matches++;
    if (loop_check($rejectedStrs, $input4)) $l_matches++;
}

$start3 = microtime(true);
$rejectedStrs = array("/loop/", "/efficiency/", "/explain/");
$s_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (preg_check($rejectedStrs, $input)) $s_matches++;
    if (preg_check($rejectedStrs, $input2)) $s_matches++;
    if (preg_check($rejectedStrs, $input3)) $s_matches++;
    if (preg_check($rejectedStrs, $input4)) $s_matches++;
}

$end = microtime(true);
echo $p_matches." ".$l_matches." ".$s_matches."\n";
echo "str_match: ".$start1." ".$start2."= ".($start2-$start1)."\nloop_match: ".$start2." ".$start3."=".($start3-$start2)."\npreg_match: ".$start3." ".$end."=".($end-$start3);

function preg_check($rejectedStrs, $input) {
    if($input == preg_replace($rejectedStrs, "", $input)) 
        return true;
    return false;
}

function loop_check($badwords, $string) {

    foreach (str_word_count($string, 1) as $word) {
        foreach ($badwords as $bw) {
            if (stripos($word, $bw) === 0) {
                return false;
            }
        }
    }
    return true;
}

function str_check($badwords, $str) {
    foreach ($badwords as $word) {
        if (stripos(" $str ", " $word ") !== false) {
            return false;
        }
    }
    return true;
}

Results

20000 20000 20000
str_match: 1282270516.6934 1282270518.5881= 1.894730091095
loop_match: 1282270518.5881 1282270523.0943=4.5061857700348
preg_match: 1282270523.0943 1282270523.6191=0.52475500106812

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

晒暮凉 2024-09-21 19:47:57

我们首先看一下 preg_check 和 loop_check。他们都必须遍历整个字符串，并且必须检查每次遍历中的每个单词。所以他们的行为至少是O(n*m)，其中n是字符串的长度，m是坏词的数量。您可以通过增加 n 和 m 的值来运行算法并绘制 3D 图形来测试这一点（但是，您可能需要也可能不需要以非常大的值来运行它） n 和 m 的值较高才能看到此行为）。

loop_check 在这里更（渐近）高效。原因是字符串的单词数量与其长度不成正比——我似乎记得它通常遵循对数函数。它可能使用哈希表来存储它通过这种方式找到的单词，这是在平均恒定时间内完成的（如果我们忽略我们可能必须不时重建哈希表以容纳更多元素）。

因此，loop_check 将具有类似于 n + m * log(n) 的渐近行为，这比 n*m 更好。

现在，这指的是算法的渐近行为，即，当m和n变得非常大（并且可能需要“非常非常”）时。对于较小的 m 和 n 值，常量起着很大的作用。特别是，执行 PHP 操作码和 PHP 函数调用比用 C 实现的相同任务成本更高，仅需要一次函数调用。这并不会使正则表达式算法更快，它只是使 m 和 n 的小值更快。

回复收藏 0 原文

绝不放开 2024-09-21 19:47:57

您能解释一下为什么会出现这种情况吗？

简单的。 preg_match是用C实现的。其他解决方案是用PHP实现的。现在，这并不意味着正则表达式总是比等效的 PHP 更快，但大多数时候，它可能会这样。

我最近遇到了类似的情况，我有一个函数（CamelCase 转换器）被调用数十次数千次，并占用相当多的 CPU（我分析过）。我尝试了所有我能想到的 PHP 重新实现。 preg_replace 总是更快。最后，我保留了该函数的原样，并记住了它，这起到了作用。

在许多情况下，执行的 PHP 语句越少越好。如果您可以通过对 C 底层实现的函数的一次调用来替换循环，那么这可能是您最好的选择。

确实，它比循环的可读性/可维护性低

。一句台词就这么简单。虽然我可能会选择更像的东西

function preg_check($rejectedStrs, $input) {
    return preg_match($rejectedStrs, "", $input);
}

Can you explain why this was the case?

Easy. preg_match is implemented in C. The other solutions are implemented in PHP. Now, that doesn't mean a regex will always be faster than the equivalent PHP, but most of the time, it probably will be.

I recently had a similar situation, where I had a function (a CamelCase converter) that was being called 10s of thousands of times, and taking a fair amount of CPU (I profiled). I tried every PHP reimplementation I could dream up. The preg_replace was always faster. In the end, I left the function as it was, and memoized it, which did the trick.

In many cases, the fewer PHP statements executed, the better. If you can replace a loop with a single call to a function that's implemented in C under the hood, that may be your best bet.

really it is less readable/maintainable than the loops

I disagree. One-liners are as simple as it gets. Although I'd probably go with something more like

function preg_check($rejectedStrs, $input) {
    return preg_match($rejectedStrs, "", $input);
}

回复收藏 0 原文

~没有更多了~

关于作者

旧街凉风

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

Preg_replace的效率

执行摘要：

：

Executive Summary:

Results

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

Preg_replace的效率

执行摘要：

：

Executive Summary:

Results

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。