PHP 数组 - 删除重复项(时间复杂度)

发布于 2024-07-12 03:39:52 字数 1308 浏览 6 评论 0原文

好吧,这不是“如何获取所有唯一值”或“如何从 php 数组中删除重复项”的问题。 这是一个关于时间复杂度的问题。

我认为array_unique有点O(n^2 - n),这是我的实现:

function array_unique2($array) 
{ 
    $to_return = array(); 
    $current_index = 0;

    for ( $i = 0 ; $i < count($array); $i++ ) 
    { 
        $current_is_unique = true; 

        for ( $a = $i+1; $a < count($array); $a++ ) 
        { 
            if ( $array[$i] == $array[$a] ) 
            { 
                $current_is_unique = false; 
                break; 
            } 
        } 
        if ( $current_is_unique ) 
        { 
            $to_return[$current_index] = $array[$i];
        } 

    } 

    return $to_return; 
}

但是,当针对array_unique进行基准测试时,我得到了以下结果:

测试(array_unique2 )...操作花费了 0.52146291732788 秒。

测试(array_unique)...操作花费了 0.28323101997375 秒。

这使得 array_unique 的速度提高了一倍,我的问题是,为什么(两者都有相同的随机数据)?

我的一个朋友写了以下内容:

function array_unique2($a)
{
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

比 php 中内置的速度快一倍。

我想知道,为什么?

array_unique 和 in_array 的时间复杂度是多少?

编辑 我从两个循环中删除了 count($array),只在函数顶部使用了一个变量,这在 100 000 个元素上节省了 2 秒!

Okay this is not a question of "how to get all uniques" or "How to remove duplicates from my array in php". This is a question about the time complexity.

I figured that the array_unique is somewhat O(n^2 - n) and here's my implementation:

function array_unique2($array) 
{ 
    $to_return = array(); 
    $current_index = 0;

    for ( $i = 0 ; $i < count($array); $i++ ) 
    { 
        $current_is_unique = true; 

        for ( $a = $i+1; $a < count($array); $a++ ) 
        { 
            if ( $array[$i] == $array[$a] ) 
            { 
                $current_is_unique = false; 
                break; 
            } 
        } 
        if ( $current_is_unique ) 
        { 
            $to_return[$current_index] = $array[$i];
        } 

    } 

    return $to_return; 
}

However when benchmarking this against the array_unique i got the following result:

Testing (array_unique2)... Operation took 0.52146291732788 s.

Testing (array_unique)... Operation took 0.28323101997375 s.

Which makes the array_unique twice as fast, my question is, why ( Both had the same random data ) ?

And a friend of mine wrote the following:

function array_unique2($a)
{
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

which is twice as fast as the built in one in php.

I'd like to know, why?

What is the time-complexity of array_unique and in_array?

Edit
I removed the count($array) from both loops and just used a variable in the top of the function, that gained 2 seconds on 100 000 elements!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

疯狂的代价 2024-07-19 03:39:52

虽然我不能代表原生 array_unique 函数,但我可以告诉您,您朋友的算法更快,因为:

  1. 他使用单个 foreach 循环,而不是您的双 for() 循环。
  2. 在 PHP 中,Foreach 循环往往比 for 循环执行得更快。
  3. 他使用了一个 if(! ) 比较,而您使用了两个 if() 结构。
  4. 您的朋友进行的唯一附加函数调用是 in_array,而您调用了 count() 两次。
  5. 您做了三个变量声明,而您的朋友不必这样做($a、$current_is_unique、$current_index)。

虽然这些因素本身都不是很大,但我可以看到累积效应会使您的算法比您的朋友花费更长的时间。

While I can't speak for the native array_unique function, I can tell you that your friends algorithm is faster because:

  1. He uses a single foreach loop as opposed to your double for() loop.
  2. Foreach loops tend to perform faster than for loops in PHP.
  3. He used a single if(! ) comparison while you used two if() structures
  4. The only additional function call your friend made was in_array whereas you called count() twice.
  5. You made three variable declarations that your friend didn't have to ($a, $current_is_unique, $current_index)

While none of these factors alone is huge, I can see where the cumulative effect would make your algorithm take longer than your friends.

江南月 2024-07-19 03:39:52

in_array() 的时间复杂度为 O(n)。 要了解这一点,我们将查看 PHP源代码

in_array() 函数在 ext/standard/array.c 中实现。 它所做的只是调用 php_search_array(),其中包含以下循环:

while (zend_hash_get_current_data_ex(target_hash, (void **)&entry, &pos) == SUCCESS) {

    // checking the value...

    zend_hash_move_forward_ex(target_hash, &pos);
}

这就是线性特征的来源。

这是该算法的整体特征,因为 zend_hash_move_forward_ex() 具有恒定的行为:查看 Zend/zend_hash.c,我们看到它基本上只是

*current = (*current)->pListNext;

时间复杂度array_unique()

  • 首先,将创建数组的副本,这是一个具有线性特性的操作,
  • 然后是struct bucketindex 将被创建,指向我们数组副本的指针将被放入这些桶中 - 再次线性特性,
  • 然后,bucketindex - 数组将使用快速排序进行排序 - 平均而言,n log n
  • ,最后,排序后的数组将被遍历,并且重复的条目将从我们的数组副本中删除 - 这应该再次是线性,假设从数组中删除是一个恒定时间操作

希望这有帮助;)

The time complexity of in_array() is O(n). To see this, we'll take a look at the PHP source code.

The in_array() function is implemented in ext/standard/array.c. All it does is call php_search_array(), which contains the following loop:

while (zend_hash_get_current_data_ex(target_hash, (void **)&entry, &pos) == SUCCESS) {

    // checking the value...

    zend_hash_move_forward_ex(target_hash, &pos);
}

That's where the linear characteristic comes from.

This is the overall characteristic of the algorithm, becaus zend_hash_move_forward_ex() has constant behaviour: Looking at Zend/zend_hash.c, we see that it's basically just

*current = (*current)->pListNext;

As for the time complexity of array_unique():

  • first, a copy of the array will be created, which is an operation with linear characteristic
  • then, a C array of struct bucketindex will be created and pointers into our array's copy will be put into these buckets - linear characteristic again
  • then, the bucketindex-array will be sorted usign quicksort - n log n on average
  • and lastly, the sorted array will be walked and and duplicate entries will be removed from our array's copy - this should be linear again, assuming that deletion from our array is a constant time operation

Hope this helps ;)

少钕鈤記 2024-07-19 03:39:52

试试这个算法。 它利用了键查找比 in_array() 更快的事实:

function array_unique_mine($A) {
    $keys = Array();
    $values = Array();
    foreach ($A as $k => $v) {
        if (!array_key_exists($v, $values)) {
            $keys[] = $k;
            $values[$v] = $v;
        }
    }
    return array_combine($keys, $values);
}

Try this algorithm. It takes advantage of the fact that the key lookup is faster than in_array():

function array_unique_mine($A) {
    $keys = Array();
    $values = Array();
    foreach ($A as $k => $v) {
        if (!array_key_exists($v, $values)) {
            $keys[] = $k;
            $values[$v] = $v;
        }
    }
    return array_combine($keys, $values);
}
狼亦尘 2024-07-19 03:39:52

加布里埃尔的 答案有一些关于为什么你朋友的方法胜过你的方法的要点。 对Christoph之后的对话很感兴趣 答案,我决定自己运行一些测试。

另外,我尝试使用不同长度的随机字符串,虽然结果不同,但顺序是相同的。 为简洁起见,我在本示例中使用了 6 个字符。

请注意,array_unique5 实际上具有与 native、2 和 3 相同的键,但只是输出顺序不同。

结果...

Testing 10000 array items of data over 1000 iterations:
array_unique6:  1.7561039924622 array ( 9998 => 'b',    9992 => 'a',    9994 => 'f',    9997 => 'e',    9993 => 'c',    9999 => 'd',    )
array_unique4:  1.8798060417175 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   4 => 'c',   5 => 'd',   )
array_unique5:  7.5023629665375 array ( 10 => 'd',  0 => 'b',   3 => 'e',   2 => 'f',   9 => 'c',   1 => 'a',   )
array_unique3:  11.356487989426 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique:   22.535032987595 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique2:  62.107122898102 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique7:  71.557286024094 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )

和代码...

set_time_limit(0);
define('HASH_TIMES', 1000);

header('Content-Type: text/plain');

$aInput  = array();
for ($i = 0; $i < 10000; $i++) {
    array_push($aInput, chr(rand(97, 102)));
}

function array_unique2($a) {
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

function array_unique3($aOriginal) {
    $aUnique = array();

    foreach ($aOriginal as $sKey => $sValue) {
        if (!isset($aUnique[$sValue])) {
            $aUnique[$sValue] = $sKey;
        }
    }

    return array_flip($aUnique);
}

function array_unique4($aOriginal) {
    return array_keys(array_flip($aOriginal));
}

function array_unique5($aOriginal) {
    return array_flip(array_flip(array_reverse($aOriginal, true)));
}

function array_unique6($aOriginal) {
    return array_flip(array_flip($aOriginal));
}

function array_unique7($A) {
    $keys = Array();
    $values = Array();
    foreach ($A as $k => $v) {
        if (!array_key_exists($v, $values)) {
            $keys[] = $k;
            $values[$v] = $v;
        }
    }
    return array_combine($keys, $values);
}

function showResults($sMethod, $fTime, $aInput) {
    echo $sMethod . ":\t" . $fTime . "\t" . implode("\t", array_map('trim', explode("\n", var_export(call_user_func($sMethod, $aInput), 1)))) . "\n";
}

echo 'Testing ' . (count($aInput)) . ' array items of data over ' . HASH_TIMES . " iterations:\n";

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique($aInput);
$aResults['array_unique'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique2($aInput);
$aResults['array_unique2'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique3($aInput);
$aResults['array_unique3'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique4($aInput);
$aResults['array_unique4'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique5($aInput);
$aResults['array_unique5'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique6($aInput);
$aResults['array_unique6'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique7($aInput);
$aResults['array_unique7'] = microtime(1) - $fTime;

asort($aResults, SORT_NUMERIC);
foreach ($aResults as $sMethod => $fTime) {
    showResults($sMethod, $fTime, $aInput);
}

使用来自评论的 Christoph's 数据集的结果:

$aInput = array(); for($i = 0; $i < 1000; ++$i) $aInput[$i] = $i; for($i = 500; $i < 700; ++$i) $aInput[10000 + $i] = $i;

Testing 1200 array items of data over 1000 iterations:
array_unique6:  0.83235597610474
array_unique4:  0.84050011634827
array_unique5:  1.1954448223114
array_unique3:  2.2937450408936
array_unique7:  8.4412341117859
array_unique:   15.225166797638
array_unique2:  48.685120105743

Gabriel's answer has some great points about why your friend's method beats yours. Intrigued by the conversation following Christoph's answer, I decided to run some tests of my own.

Also, I tried this with differing lengths of random strings and although the results were different, the order was the same. I used 6 chars in this example for brevity.

Notice that array_unique5 actually has the same keys as native, 2 and 3, but just outputs in a different order.

Results...

Testing 10000 array items of data over 1000 iterations:
array_unique6:  1.7561039924622 array ( 9998 => 'b',    9992 => 'a',    9994 => 'f',    9997 => 'e',    9993 => 'c',    9999 => 'd',    )
array_unique4:  1.8798060417175 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   4 => 'c',   5 => 'd',   )
array_unique5:  7.5023629665375 array ( 10 => 'd',  0 => 'b',   3 => 'e',   2 => 'f',   9 => 'c',   1 => 'a',   )
array_unique3:  11.356487989426 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique:   22.535032987595 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique2:  62.107122898102 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique7:  71.557286024094 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )

And The Code...

set_time_limit(0);
define('HASH_TIMES', 1000);

header('Content-Type: text/plain');

$aInput  = array();
for ($i = 0; $i < 10000; $i++) {
    array_push($aInput, chr(rand(97, 102)));
}

function array_unique2($a) {
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

function array_unique3($aOriginal) {
    $aUnique = array();

    foreach ($aOriginal as $sKey => $sValue) {
        if (!isset($aUnique[$sValue])) {
            $aUnique[$sValue] = $sKey;
        }
    }

    return array_flip($aUnique);
}

function array_unique4($aOriginal) {
    return array_keys(array_flip($aOriginal));
}

function array_unique5($aOriginal) {
    return array_flip(array_flip(array_reverse($aOriginal, true)));
}

function array_unique6($aOriginal) {
    return array_flip(array_flip($aOriginal));
}

function array_unique7($A) {
    $keys = Array();
    $values = Array();
    foreach ($A as $k => $v) {
        if (!array_key_exists($v, $values)) {
            $keys[] = $k;
            $values[$v] = $v;
        }
    }
    return array_combine($keys, $values);
}

function showResults($sMethod, $fTime, $aInput) {
    echo $sMethod . ":\t" . $fTime . "\t" . implode("\t", array_map('trim', explode("\n", var_export(call_user_func($sMethod, $aInput), 1)))) . "\n";
}

echo 'Testing ' . (count($aInput)) . ' array items of data over ' . HASH_TIMES . " iterations:\n";

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique($aInput);
$aResults['array_unique'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique2($aInput);
$aResults['array_unique2'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique3($aInput);
$aResults['array_unique3'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique4($aInput);
$aResults['array_unique4'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique5($aInput);
$aResults['array_unique5'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique6($aInput);
$aResults['array_unique6'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique7($aInput);
$aResults['array_unique7'] = microtime(1) - $fTime;

asort($aResults, SORT_NUMERIC);
foreach ($aResults as $sMethod => $fTime) {
    showResults($sMethod, $fTime, $aInput);
}

Results using Christoph's data set from the comments:

$aInput = array(); for($i = 0; $i < 1000; ++$i) $aInput[$i] = $i; for($i = 500; $i < 700; ++$i) $aInput[10000 + $i] = $i;

Testing 1200 array items of data over 1000 iterations:
array_unique6:  0.83235597610474
array_unique4:  0.84050011634827
array_unique5:  1.1954448223114
array_unique3:  2.2937450408936
array_unique7:  8.4412341117859
array_unique:   15.225166797638
array_unique2:  48.685120105743
清旖 2024-07-19 03:39:52

PHP 的数组是作为哈希表实现的,即它们的性能特征与您对“真实”数组的期望不同。 数组的键值对另外存储在链表中以允许快速迭代。

这解释了为什么您的实现比您朋友的实现如此慢:对于每个数字索引,您的算法必须执行哈希表查找,而 foreach() 循环只会迭代链接列表。

以下实现使用反向哈希表,可能是最快的实现(双重翻转由 joe_mucchiello 提供):

function array_unique2($array) {
    return array_flip(array_flip($array));
}

只有当 $array 的值是有效的键,即整数或字符串。

我还使用 foreach() 循环重新实现了您的算法。 现在,对于小数据集,它实际上比你朋友的更快,但仍然比通过 array_flip() 的解决方案慢:

function array_unique3($array) {
    $unique_array = array();

    foreach($array as $current_key => $current_value) {
        foreach($unique_array as $old_value) {
            if($current_value === $old_value)
                continue 2;
        }
        $unique_array[$current_key] = $current_value;
    }

    return $unique_array;
}

对于大数据集,内置版本 array_unique()< /code> 将胜过除双翻转之外的所有其他方法。 此外,您的朋友使用 in_array() 的版本将比 array_unique3() 更快。

总结一下:本机代码获胜!


还有另一个版本,它应该保留键及其顺序:

function array_flop($array) {
    $flopped_array = array();

    foreach($array as $key => $value) {
        if(!isset($flopped_array[$value]))
            $flopped_array[$value] = $key;
    }

    return $flopped_array;
}

function array_unique4($array) {
    return array_flip(array_flop($array));
}

这实际上是 enobrevarray_unique3() - 我没有像我应该的那样彻底检查他的实现。 ..

PHP's arrays are implemented as hash tables, i.e. their performance characteristics are different from what you'd expect from 'real' arrays. An array's key-value-pairs are additionally stored in a linked list to allow fast iteration.

This explains why your implementation is so slow compared to your friend's: For every numeric index, your algorithm has to do a hash table lookup, whereas a foreach()-loop will just iterate over a linked list.

The following implementation uses a reverse hash table and might be the fastest of the crowd (double-flipping courtesy of joe_mucchiello):

function array_unique2($array) {
    return array_flip(array_flip($array));
}

This will only work if the values of $array are valid keys, ie integers or strings.

I also reimplemented your algorithm using foreach()-loops. Now, it will actually be faster than your friend's for small data sets, but still slower than the solution via array_flip():

function array_unique3($array) {
    $unique_array = array();

    foreach($array as $current_key => $current_value) {
        foreach($unique_array as $old_value) {
            if($current_value === $old_value)
                continue 2;
        }
        $unique_array[$current_key] = $current_value;
    }

    return $unique_array;
}

For large data sets, the built-in version array_unique() will outperform all other's except the double-flipping one. Also, the version using in_array() by your friend will be faster than array_unique3().

To summarize: Native code for the win!


Yet another version, which should preserve keys and their ordering:

function array_flop($array) {
    $flopped_array = array();

    foreach($array as $key => $value) {
        if(!isset($flopped_array[$value]))
            $flopped_array[$value] = $key;
    }

    return $flopped_array;
}

function array_unique4($array) {
    return array_flip(array_flop($array));
}

This is actually enobrev's array_unique3() - I didn't check his implementations as thoroughly as I should have...

陌伤浅笑 2024-07-19 03:39:52

PHP 的执行速度比原始机器代码(很可能由 array_unique 执行)慢。

你的第二个示例函数(你朋友写的那个)很有趣。 我不知道它如何比本机实现更快,除非本机实现是删除元素而不是构建新数组。

PHP is slower to execute than raw machine code (which is most likely executed by array_unique).

Your second example function (the one your friend wrote) is interesting. I do not see how it would be faster than the native implementation, unless the native one is removing elements instead of building a new array.

百合的盛世恋 2024-07-19 03:39:52

我承认我不太了解本机代码,但它似乎复制整个数组,对其进行排序,然后循环遍历它以删除重复项。 在这种情况下,您的第二段代码实际上是一种更有效的算法,因为添加到数组的末尾比从中间删除更便宜。

请记住,PHP 开发人员这样做可能有充分的理由。 有人想问他们吗?

I'll admit I don't understand the native code very well, but it seems to copy the entire array, sort it, then loop through it removing duplicates. In that case your second piece of code is actually a more efficient algorithm, since adding to the end of an array is cheaper than deleting from the middle of it.

Keep in mind the PHP developers probably had a good reason for doing it the way they do. Does anyone want to ask them?

鹿港小镇 2024-07-19 03:39:52

本机 PHP 函数 array_unique用 C 实现。 因此它比必须先翻译的 PHP 更快。 而且,PHP 使用与您不同的算法。 据我所知,PHP 首先使用 快速排序 对元素进行排序,然后删除重复项一跑。

为什么他朋友的实现速度比他自己的快? 因为它使用了更多的内置功能来尝试重新创建它们。

The native PHP function array_unique is implemented in C. Thus it is faster than PHP, that has to be translated first. What’s more, PHP uses an different algorithm than you do. As I see it, PHP first uses Quick sort to sort the elements and then deletes the duplicates in one run.

Why his friend’s implementation is faster has his own? Because it uses more built-in functionality that trying to recreate them.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文