Preg_replace的效率

发布于 2024-09-14 19:47:57 字数 3793 浏览 4 评论 0原文

执行摘要:

preg_replace() 比字符串比较运行得更快。为什么?正则表达式不应该慢一些吗?


最近的问题中 关于检测给定输入中任何不允许的子字符串数组,我建议将 preg_replace() 调用的结果与原始输入进行比较,因为 preg_replace() code> 可以将模式数组作为输入。因此,我的方法可能是单个 if,而其他解决方案需要一个(或多个)循环。

我对争论我的答案不感兴趣,因为实际上它比循环的可读性/可维护性更差。我的答案仍然是-1,为了可读性/易于维护,我会接受这一点,但我的方法指出的最大错误是缺乏效率。这引起了我的好奇,并促使我做了一些测试。我的结果让我有点惊讶:在所有其他因素相同的情况下,preg_replace() 比任何其他方法更快

你能解释一下为什么会出现这种情况吗?

下面可以找到我的这些测试代码以及​​结果

$input = "In a recent question about detecting any of an array of disallowed substrings within a given input, I suggested comparing the result of a `preg_replace()` call to the original input, since `preg_replace()` can take an array of patterns as input. Thus my method for this could be a single `if` whereas the other solutions required one (or many) loops. I'm not interested in debating my answer, because really it is less readable/maintainable than the loops. However, the biggest fault pointed out with my method was a lack of efficiency. That got me curious, and led me to do some testing. My results were a bit surprising to me: with all other factors held equal, `preg_replace()` was **faster** than any of the other methods. Can you explain why this was the case?";
$input2 = "Short sentence - no matches";
$input3 = "Word";
$input4 = "Short sentence - matches loop";

$start1 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$p_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (str_check($rejectedStrs, $input)) $p_matches++;
    if (str_check($rejectedStrs, $input2)) $p_matches++;
    if (str_check($rejectedStrs, $input3)) $p_matches++;
    if (str_check($rejectedStrs, $input4)) $p_matches++;
}

$start2 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$l_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (loop_check($rejectedStrs, $input)) $l_matches++;
    if (loop_check($rejectedStrs, $input2)) $l_matches++;
    if (loop_check($rejectedStrs, $input3)) $l_matches++;
    if (loop_check($rejectedStrs, $input4)) $l_matches++;
}

$start3 = microtime(true);
$rejectedStrs = array("/loop/", "/efficiency/", "/explain/");
$s_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (preg_check($rejectedStrs, $input)) $s_matches++;
    if (preg_check($rejectedStrs, $input2)) $s_matches++;
    if (preg_check($rejectedStrs, $input3)) $s_matches++;
    if (preg_check($rejectedStrs, $input4)) $s_matches++;
}

$end = microtime(true);
echo $p_matches." ".$l_matches." ".$s_matches."\n";
echo "str_match: ".$start1." ".$start2."= ".($start2-$start1)."\nloop_match: ".$start2." ".$start3."=".($start3-$start2)."\npreg_match: ".$start3." ".$end."=".($end-$start3);

function preg_check($rejectedStrs, $input) {
    if($input == preg_replace($rejectedStrs, "", $input)) 
        return true;
    return false;
}

function loop_check($badwords, $string) {

    foreach (str_word_count($string, 1) as $word) {
        foreach ($badwords as $bw) {
            if (stripos($word, $bw) === 0) {
                return false;
            }
        }
    }
    return true;
}

function str_check($badwords, $str) {
    foreach ($badwords as $word) {
        if (stripos(" $str ", " $word ") !== false) {
            return false;
        }
    }
    return true;
}

20000 20000 20000

str_match: 1282270516.6934 1282270518.5881= 1.894730091095

循环匹配:1282270518.5881 1282270523.0943=4.5061857700348

preg_match:1282270523.0943 1282270523.6191=0.52475500106812

Executive Summary:

preg_replace() ran faster than string comparisons. Why? Shouldn't regular expressions be slower?


In a recent question about detecting any of an array of disallowed substrings within a given input, I suggested comparing the result of a preg_replace() call to the original input, since preg_replace() can take an array of patterns as input. Thus my method for this could be a single if whereas the other solutions required one (or many) loops.

I'm not interested in debating my answer, because really it is less readable/maintainable than the loops. My answer there still holds a -1, and I'll accept that for readability/ease of maintenance, but the biggest fault pointed out with my method was a lack of efficiency. That got me curious, and led me to do some testing. My results were a bit surprising to me: with all other factors held equal, preg_replace() was faster than any of the other methods.

Can you explain why this was the case?

My code for these tests can be found below, along with the results:

$input = "In a recent question about detecting any of an array of disallowed substrings within a given input, I suggested comparing the result of a `preg_replace()` call to the original input, since `preg_replace()` can take an array of patterns as input. Thus my method for this could be a single `if` whereas the other solutions required one (or many) loops. I'm not interested in debating my answer, because really it is less readable/maintainable than the loops. However, the biggest fault pointed out with my method was a lack of efficiency. That got me curious, and led me to do some testing. My results were a bit surprising to me: with all other factors held equal, `preg_replace()` was **faster** than any of the other methods. Can you explain why this was the case?";
$input2 = "Short sentence - no matches";
$input3 = "Word";
$input4 = "Short sentence - matches loop";

$start1 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$p_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (str_check($rejectedStrs, $input)) $p_matches++;
    if (str_check($rejectedStrs, $input2)) $p_matches++;
    if (str_check($rejectedStrs, $input3)) $p_matches++;
    if (str_check($rejectedStrs, $input4)) $p_matches++;
}

$start2 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$l_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (loop_check($rejectedStrs, $input)) $l_matches++;
    if (loop_check($rejectedStrs, $input2)) $l_matches++;
    if (loop_check($rejectedStrs, $input3)) $l_matches++;
    if (loop_check($rejectedStrs, $input4)) $l_matches++;
}

$start3 = microtime(true);
$rejectedStrs = array("/loop/", "/efficiency/", "/explain/");
$s_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (preg_check($rejectedStrs, $input)) $s_matches++;
    if (preg_check($rejectedStrs, $input2)) $s_matches++;
    if (preg_check($rejectedStrs, $input3)) $s_matches++;
    if (preg_check($rejectedStrs, $input4)) $s_matches++;
}

$end = microtime(true);
echo $p_matches." ".$l_matches." ".$s_matches."\n";
echo "str_match: ".$start1." ".$start2."= ".($start2-$start1)."\nloop_match: ".$start2." ".$start3."=".($start3-$start2)."\npreg_match: ".$start3." ".$end."=".($end-$start3);

function preg_check($rejectedStrs, $input) {
    if($input == preg_replace($rejectedStrs, "", $input)) 
        return true;
    return false;
}

function loop_check($badwords, $string) {

    foreach (str_word_count($string, 1) as $word) {
        foreach ($badwords as $bw) {
            if (stripos($word, $bw) === 0) {
                return false;
            }
        }
    }
    return true;
}

function str_check($badwords, $str) {
    foreach ($badwords as $word) {
        if (stripos(" $str ", " $word ") !== false) {
            return false;
        }
    }
    return true;
}

Results

20000 20000 20000

str_match: 1282270516.6934 1282270518.5881= 1.894730091095

loop_match: 1282270518.5881 1282270523.0943=4.5061857700348

preg_match: 1282270523.0943 1282270523.6191=0.52475500106812

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

晒暮凉 2024-09-21 19:47:57

我们首先看一下 preg_checkloop_check。他们都必须遍历整个字符串,并且必须检查每次遍历中的每个单词。所以他们的行为至少是O(n*m),其中n是字符串的长度,m是坏词的数量。您可以通过增加 nm 的值来运行算法并绘制 3D 图形来测试这一点(但是,您可能需要也可能不需要以非常大的值来运行它) nm 的值较高才能看到此行为)。

loop_check 在这里(渐近)高效。原因是字符串的单词数量与其长度不成正比——我似乎记得它通常遵循对数函数。它可能使用哈希表来存储它通过这种方式找到的单词,这是在平均恒定时间内完成的(如果我们忽略我们可能必须不时重建哈希表以容纳更多元素)。

因此,loop_check 将具有类似于 n + m * log(n) 的渐近行为,这比 n*m 更好。

现在,这指的是算法的渐近行为,即,当mn变得非常大(并且可能需要“非常非常”)时。对于较小的 mn 值,常量起着很大的作用。特别是,执行 PHP 操作码和 PHP 函数调用比用 C 实现的相同任务成本更高,仅需要一次函数调用。这并不会使正则表达式算法更快,它只是使 mn 的小值更快。

Let's first look at preg_check and loop_check. Both of them will have to traverse the entire string, and they will have to check each of the individual words in each traversal. So their behavior will at least be O(n*m), where n is the length of the string and m the number of bad words. You can test this by running the algorithm with increasing values of n and m and plotting the 3D graphs (however, you may, or may not, have to run it with very high values of n and m to see this behavior).

loop_check is more (asymptoticly) efficient here. The reason is that the number of words a string has is not proportional to their length -- I seem to recall it typically follows a logarithmic function. It probably uses a hash table to store the words it finds through the way, which is done in average constant time (if we ignore that we may have to rebuild the hash table from time to time to accommodate more elements).

Therefore loop_check will have an asymptotic behavior that follows something like n + m * log(n), which is better than n*m.

Now, this refers to the asymptotic behavior of the algorithms, i.e., when m and n grow very (and it may require "very very") large. For small values of m and n the constants play a big part. In particular, execution of PHP opcodes and PHP function calls are more costly than the same task implemented in C, just one function call away. This doesn't make the regex algorithm faster, it just makes it faster for small values of m and n.

绝不放开 2024-09-21 19:47:57

您能解释一下为什么会出现这种情况吗?

简单的。 preg_match是用C实现的。其他解决方案是用PHP实现的。现在,这并不意味着正则表达式总是比等效的 PHP 更快,但大多数时候,它可能会这样。

我最近遇到了类似的情况,我有一个函数(CamelCase 转换器)被调用数十次数千次,并占用相当多的 CPU(我分析过)。我尝试了所有我能想到的 PHP 重新实现。 preg_replace 总是更快。最后,我保留了该函数的原样,并记住了它,这起到了作用。

在许多情况下,执行的 PHP 语句越少越好。如果您可以通过对 C 底层实现的函数的一次调用来替换循环,那么这可能是您最好的选择。

确实,它比循环的可读性/可维护性低

。一句台词就这么简单。虽然我可能会选择更像的东西

function preg_check($rejectedStrs, $input) {
    return preg_match($rejectedStrs, "", $input);
}

Can you explain why this was the case?

Easy. preg_match is implemented in C. The other solutions are implemented in PHP. Now, that doesn't mean a regex will always be faster than the equivalent PHP, but most of the time, it probably will be.

I recently had a similar situation, where I had a function (a CamelCase converter) that was being called 10s of thousands of times, and taking a fair amount of CPU (I profiled). I tried every PHP reimplementation I could dream up. The preg_replace was always faster. In the end, I left the function as it was, and memoized it, which did the trick.

In many cases, the fewer PHP statements executed, the better. If you can replace a loop with a single call to a function that's implemented in C under the hood, that may be your best bet.

really it is less readable/maintainable than the loops

I disagree. One-liners are as simple as it gets. Although I'd probably go with something more like

function preg_check($rejectedStrs, $input) {
    return preg_match($rejectedStrs, "", $input);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文