PHP 数组 - 删除重复项(时间复杂度)
好吧,这不是“如何获取所有唯一值”或“如何从 php 数组中删除重复项”的问题。 这是一个关于时间复杂度的问题。
我认为array_unique
有点O(n^2 - n),这是我的实现:
function array_unique2($array)
{
$to_return = array();
$current_index = 0;
for ( $i = 0 ; $i < count($array); $i++ )
{
$current_is_unique = true;
for ( $a = $i+1; $a < count($array); $a++ )
{
if ( $array[$i] == $array[$a] )
{
$current_is_unique = false;
break;
}
}
if ( $current_is_unique )
{
$to_return[$current_index] = $array[$i];
}
}
return $to_return;
}
但是,当针对array_unique
进行基准测试时,我得到了以下结果:
测试(array_unique2 )...操作花费了 0.52146291732788 秒。
测试(array_unique)...操作花费了 0.28323101997375 秒。
这使得 array_unique 的速度提高了一倍,我的问题是,为什么(两者都有相同的随机数据)?
我的一个朋友写了以下内容:
function array_unique2($a)
{
$n = array();
foreach ($a as $k=>$v)
if (!in_array($v,$n))
$n[$k]=$v;
return $n;
}
比 php 中内置的速度快一倍。
我想知道,为什么?
array_unique 和 in_array 的时间复杂度是多少?
编辑 我从两个循环中删除了 count($array),只在函数顶部使用了一个变量,这在 100 000 个元素上节省了 2 秒!
Okay this is not a question of "how to get all uniques" or "How to remove duplicates from my array in php". This is a question about the time complexity.
I figured that the array_unique
is somewhat O(n^2 - n) and here's my implementation:
function array_unique2($array)
{
$to_return = array();
$current_index = 0;
for ( $i = 0 ; $i < count($array); $i++ )
{
$current_is_unique = true;
for ( $a = $i+1; $a < count($array); $a++ )
{
if ( $array[$i] == $array[$a] )
{
$current_is_unique = false;
break;
}
}
if ( $current_is_unique )
{
$to_return[$current_index] = $array[$i];
}
}
return $to_return;
}
However when benchmarking this against the array_unique
i got the following result:
Testing (array_unique2)... Operation took 0.52146291732788 s.
Testing (array_unique)... Operation took 0.28323101997375 s.
Which makes the array_unique twice as fast, my question is, why ( Both had the same random data ) ?
And a friend of mine wrote the following:
function array_unique2($a)
{
$n = array();
foreach ($a as $k=>$v)
if (!in_array($v,$n))
$n[$k]=$v;
return $n;
}
which is twice as fast as the built in one in php.
I'd like to know, why?
What is the time-complexity of array_unique and in_array?
Edit
I removed the count($array) from both loops and just used a variable in the top of the function, that gained 2 seconds on 100 000 elements!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
虽然我不能代表原生 array_unique 函数,但我可以告诉您,您朋友的算法更快,因为:
虽然这些因素本身都不是很大,但我可以看到累积效应会使您的算法比您的朋友花费更长的时间。
While I can't speak for the native array_unique function, I can tell you that your friends algorithm is faster because:
While none of these factors alone is huge, I can see where the cumulative effect would make your algorithm take longer than your friends.
in_array()
的时间复杂度为 O(n)。 要了解这一点,我们将查看 PHP源代码。in_array()
函数在ext/standard/array.c
中实现。 它所做的只是调用php_search_array()
,其中包含以下循环:这就是线性特征的来源。
这是该算法的整体特征,因为
zend_hash_move_forward_ex()
具有恒定的行为:查看Zend/zend_hash.c
,我们看到它基本上只是时间复杂度
array_unique()
:struct bucketindex 将被创建,指向我们数组副本的指针将被放入这些桶中 - 再次线性特性,
bucketindex
- 数组将使用快速排序进行排序 - 平均而言,nlog
n希望这有帮助;)
The time complexity of
in_array()
is O(n). To see this, we'll take a look at the PHP source code.The
in_array()
function is implemented inext/standard/array.c
. All it does is callphp_search_array()
, which contains the following loop:That's where the linear characteristic comes from.
This is the overall characteristic of the algorithm, becaus
zend_hash_move_forward_ex()
has constant behaviour: Looking atZend/zend_hash.c
, we see that it's basically justAs for the time complexity of
array_unique()
:struct bucketindex
will be created and pointers into our array's copy will be put into these buckets - linear characteristic againbucketindex
-array will be sorted usign quicksort - nlog
n on averageHope this helps ;)
试试这个算法。 它利用了键查找比 in_array() 更快的事实:
Try this algorithm. It takes advantage of the fact that the key lookup is faster than in_array():
加布里埃尔的 答案有一些关于为什么你朋友的方法胜过你的方法的要点。 对Christoph之后的对话很感兴趣 答案,我决定自己运行一些测试。
另外,我尝试使用不同长度的随机字符串,虽然结果不同,但顺序是相同的。 为简洁起见,我在本示例中使用了 6 个字符。
请注意,array_unique5 实际上具有与 native、2 和 3 相同的键,但只是输出顺序不同。
结果...
和代码...
使用来自评论的 Christoph's 数据集的结果:
Gabriel's answer has some great points about why your friend's method beats yours. Intrigued by the conversation following Christoph's answer, I decided to run some tests of my own.
Also, I tried this with differing lengths of random strings and although the results were different, the order was the same. I used 6 chars in this example for brevity.
Notice that array_unique5 actually has the same keys as native, 2 and 3, but just outputs in a different order.
Results...
And The Code...
Results using Christoph's data set from the comments:
PHP 的数组是作为哈希表实现的,即它们的性能特征与您对“真实”数组的期望不同。 数组的键值对另外存储在链表中以允许快速迭代。
这解释了为什么您的实现比您朋友的实现如此慢:对于每个数字索引,您的算法必须执行哈希表查找,而
foreach()
循环只会迭代链接列表。以下实现使用反向哈希表,可能是最快的实现(双重翻转由 joe_mucchiello 提供):
只有当
$array
的值是有效的键,即整数或字符串。我还使用 foreach() 循环重新实现了您的算法。 现在,对于小数据集,它实际上比你朋友的更快,但仍然比通过 array_flip() 的解决方案慢:
对于大数据集,内置版本 array_unique()< /code> 将胜过除双翻转之外的所有其他方法。 此外,您的朋友使用
in_array()
的版本将比array_unique3()
更快。总结一下:本机代码获胜!
还有另一个版本,它应该保留键及其顺序:
这实际上是 enobrev 的
array_unique3()
- 我没有像我应该的那样彻底检查他的实现。 ..PHP's arrays are implemented as hash tables, i.e. their performance characteristics are different from what you'd expect from 'real' arrays. An array's key-value-pairs are additionally stored in a linked list to allow fast iteration.
This explains why your implementation is so slow compared to your friend's: For every numeric index, your algorithm has to do a hash table lookup, whereas a
foreach()
-loop will just iterate over a linked list.The following implementation uses a reverse hash table and might be the fastest of the crowd (double-flipping courtesy of joe_mucchiello):
This will only work if the values of
$array
are valid keys, ie integers or strings.I also reimplemented your algorithm using
foreach()
-loops. Now, it will actually be faster than your friend's for small data sets, but still slower than the solution viaarray_flip()
:For large data sets, the built-in version
array_unique()
will outperform all other's except the double-flipping one. Also, the version usingin_array()
by your friend will be faster thanarray_unique3()
.To summarize: Native code for the win!
Yet another version, which should preserve keys and their ordering:
This is actually enobrev's
array_unique3()
- I didn't check his implementations as thoroughly as I should have...PHP 的执行速度比原始机器代码(很可能由 array_unique 执行)慢。
你的第二个示例函数(你朋友写的那个)很有趣。 我不知道它如何比本机实现更快,除非本机实现是删除元素而不是构建新数组。
PHP is slower to execute than raw machine code (which is most likely executed by
array_unique
).Your second example function (the one your friend wrote) is interesting. I do not see how it would be faster than the native implementation, unless the native one is removing elements instead of building a new array.
我承认我不太了解本机代码,但它似乎复制整个数组,对其进行排序,然后循环遍历它以删除重复项。 在这种情况下,您的第二段代码实际上是一种更有效的算法,因为添加到数组的末尾比从中间删除更便宜。
请记住,PHP 开发人员这样做可能有充分的理由。 有人想问他们吗?
I'll admit I don't understand the native code very well, but it seems to copy the entire array, sort it, then loop through it removing duplicates. In that case your second piece of code is actually a more efficient algorithm, since adding to the end of an array is cheaper than deleting from the middle of it.
Keep in mind the PHP developers probably had a good reason for doing it the way they do. Does anyone want to ask them?
本机 PHP 函数
array_unique
为 用 C 实现。 因此它比必须先翻译的 PHP 更快。 而且,PHP 使用与您不同的算法。 据我所知,PHP 首先使用 快速排序 对元素进行排序,然后删除重复项一跑。为什么他朋友的实现速度比他自己的快? 因为它使用了更多的内置功能来尝试重新创建它们。
The native PHP function
array_unique
is implemented in C. Thus it is faster than PHP, that has to be translated first. What’s more, PHP uses an different algorithm than you do. As I see it, PHP first uses Quick sort to sort the elements and then deletes the duplicates in one run.Why his friend’s implementation is faster has his own? Because it uses more built-in functionality that trying to recreate them.