哪个更快?对一小部分元素进行排序或相乘?
阅读Cactus Kev 的扑克牌评估器,我注意到以下陈述:
首先,我我认为我总是可以先简单地对手牌进行排序,然后再将其传递给评估员;但排序需要时间,我不想浪费任何 CPU 周期来排序。我需要一种不关心五张牌的顺序的方法。
...
经过深思熟虑,我灵机一动,决定使用素数。我会为十三张卡牌中的每一张分配一个素数值...这个系统的美妙之处在于,如果您将手中每张卡牌的素数相乘,您就会得到一个独特的乘积,无论顺序如何五张牌中。
...
由于乘法是计算机可以进行的最快计算之一,如果我们被迫在评估之前对每手牌进行排序,我们就可以节省数百毫秒的时间。
我很难相信这一点。
Cactus Kev 将每张牌表示为 4 字节整数,并通过调用 eval_5cards( int c1, int c2, int c3, int c4, int c5 ) 来评估手牌。我们可以将牌表示为一个字节,将一手扑克牌表示为 5 字节数组。对这个 5 字节数组进行排序以获得唯一的手牌一定非常快。比他的手段还快吗?
如果我们保留他的表示形式(卡片为 4 字节整数)会怎么样?对 5 个整数的数组进行排序是否比它们相乘更快?如果不是,可以进行哪种低级优化来更快地对少量元素进行排序?
谢谢!
大家的回答都很好;我正在研究排序与乘法的性能基准测试,以获得一些硬性能统计数据。
Reading through Cactus Kev's Poker Hand Evaluator, I noticed the following statements:
At first, I thought that I could always simply sort the hand first before passing it to the evaluator; but sorting takes time, and I didn't want to waste any CPU cycles sorting hands. I needed a method that didn't care what order the five cards were given as.
...
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
...
Since multiplication is one of the fastest calculations a computer can make, we have shaved hundreds of milliseconds off our time had we been forced to sort each hand before evaluation.
I have a hard time believing this.
Cactus Kev represents each card as a 4-byte integer, and evaluates hands by calling eval_5cards( int c1, int c2, int c3, int c4, int c5 )
. We could represent cards as one byte, and a poker hand as a 5-byte array. Sorting this 5-byte array to get a unique hand must be pretty fast. Is it faster than his approach?
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them? If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
Thanks!
Good answers everyone; I'm working on benchmarking the performance of sorting vs multiplication, to get some hard performance statistics.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
当然,这在很大程度上取决于计算机的CPU,但是典型的Intel CPU(例如Core 2 Duo)可以在3个CPU时钟周期内将两个32位数字相乘。对于排序算法来说,要克服这个问题,该算法需要比 3 * 4 = 12 个 CPU 周期更快,这是一个非常严格的约束。任何标准排序算法都无法在少于 12 个周期内完成。仅两个数字的比较将花费 1 个 CPU 周期,结果的条件分支也将花费 1 个 CPU 周期,无论您做什么,都将至少花费 1 个 CPU 周期(交换两张卡实际上至少需要 4 个 CPU 周期)。所以乘法获胜。
当然,这并没有考虑从一级或二级缓存甚至内存中获取卡值的延迟;然而,这种延迟适用于乘法和排序这两种情况。
Of course it depends a lot on the CPU of your computer, but a typical Intel CPU (e.g. Core 2 Duo) can multiply two 32 Bit numbers within 3 CPU clock cycles. For a sort algorithm to beat that, the algorithm needs to be faster than 3 * 4 = 12 CPU cycles, which is a very tight constraint. None of the standard sorting algorithms can do it in less than 12 cycles for sure. Alone the comparison of two numbers will take one CPU cycle, the conditional branch on the result will also take one CPU cycle and whatever you do then will at least take one CPU cycle (swapping two cards will actually take at least 4 CPU cycles). So multiplying wins.
Of course this is not taking the latency into account to fetch the card value from either 1st or 2nd level cache or maybe even memory; however, this latency applies to either case, multiplying and sorting.
未经测试,我同意他的论点。与排序相比,您可以通过 4 次乘法来完成,排序是
n log n
。具体来说,最佳的排序网络需要9次比较。然后求值器必须至少查看排序数组的每个元素,这又是 5 个操作。Without testing, I'm sympathetic to his argument. You can do it in 4 multiplications, as compared to sorting, which is
n log n
. Specifically, the optimal sorting network requires 9 comparisons. The evaluator then has to at least look at every element of the sorted array, which is another 5 operations.排序本质上并不比数字相乘困难。从理论上讲,它们大致相同,并且您还需要复杂的乘法算法来使大型乘法与大型排序竞争。而且,当所提出的乘法算法可行时,还可以使用桶排序,其渐近速度更快。
然而,一手扑克牌并不是渐近问题。只有 5 张牌,他只关心牌的 13 个数字值之一。即使乘法在原理上很复杂,但实际上它是用微代码实现的,而且速度非常快。他正在做的事情有效。
现在,如果您对理论问题感兴趣,还有一个使用加法而不是乘法的解决方案。任一值只能有 4 张卡,因此您也可以指定值 1,5,25,...,5^12 并将它们相加。它仍然适合 32 位算术。还有其他具有其他数学属性的基于加法的解决方案。但这并不重要,因为微编码算术比计算机正在执行的任何其他操作都要快得多。
Sorting is not intrinsically harder than multiplying numbers. On paper, they're about the same, and you also need a sophisticated multiplication algorithm to make large multiplication competitive with large sort. Moreover, when the proposed multiplication algorithm is feasible, you can also use bucket sort, which is asymptotically faster.
However, a poker hand is not an asymptotic problem. It's just 5 cards and he only cares about one of the 13 number values of the card. Even if multiplication is complicated in principle, in practice it is implemented in microcode and it's incredibly fast. What he's doing works.
Now, if you're interested in the theoretical question, there is also a solution using addition rather than multiplication. There can only be 4 cards of any one value, so you could just as well assign the values 1,5,25,...,5^12 and add them. It still fits in 32-bit arithmetic. There are also other addition-based solutions with other mathematical properties. But it really doesn't matter, because microcoded arithmetic is so much faster than anything else that the computer is doing.
使用优化的决策树可以对 5 个元素进行排序,这比使用通用排序算法要快得多。
然而,事实仍然是排序意味着很多分支(之后必要的比较也是如此)。对于现代流水线 CPU 架构来说,分支确实很糟糕,尤其是具有相似可能性的任一方向的分支(从而破坏了分支预测逻辑)。这比乘法与比较的理论成本要高得多,使得乘法速度更快。
但如果您可以构建自定义硬件来进行排序,它可能会更快。
5 elements can be sorted using an optimized decision tree, which is much faster than using a general-purpose sorting algorithm.
However, the fact remains that sorting means lots of branches (as do the comparisons that are necessary afterwards). Branches are really bad for modern pipelined CPU architectures, especially branches that go either way with similar likelihood (thus defeating branch prediction logic). That, much more than the theoretical cost of multiplication vs. comparisons, makes multiplication faster.
But if you could build custom hardware to do the sorting, it might end up faster.
这不应该真正相关,但他是对的。排序比乘法花费的时间要长得多。
真正的问题是他对得到的质数做了什么,以及这有什么帮助(因为考虑到它,我预计比排序需要更长的时间。
That shouldn't really be relevant, but he is correct. Sorting takes much longer than multiplying.
The real question is what he did with the resulting prime number, and how that was helpful (since factoring it I would expect to take longer than sorting.
很难想象还有比同一组数字相乘更快的排序操作。在处理器级别,乘法只是加载,加载,乘法,加载,乘法,...,可能还需要对累加器进行一些操作。它是线性的,易于流水线化,与相关的分支错误预测成本。每个要相乘的值平均应有大约 2 个指令。除非乘法指令非常慢,否则很难想象更快的排序。
It's hard to think of any sorting operation that could be faster than multiplying the same set of numbers. At the processor level, the multiplication is just
load, load, multiply, load, multiply, ...
, with maybe some manipulation of the accumulator thrown in. It's linear, easily pipelined, no comparisons with the associated branch mis-prediction costs. It should average about 2 instructions per value to be multiplied. Unless the multiply instruction is painfully slow, it's really hard to imagine a faster sort.值得一提的是,即使您的 CPU 的乘法指令非常慢(或根本不存在......),您也可以使用查找表来进一步加快速度。
One thing worth mentioning is that even if your CPU's multiply instruction is dead slow (or nonexistent...) you can use a lookup table to speed things even further.
这是非位置数字系统的一个例子。
我找不到该理论的链接。我将其作为应用代数的一部分进行了研究,围绕欧拉方程和加密进行。 (我可能会用错术语,因为我已经用我的母语研究了所有这些内容。)
RAM 是一种外部资源,通常比 CPU 慢。由于交换操作,对 5 个整数进行排序总是必须进入 RAM。再加上排序函数本身的开销,乘法看起来就不再那么糟糕了。
我认为在现代 CPU 上,整数乘法几乎总是比排序快,因为可以在不同的 ALU 上同时执行多个乘法,而只有一条总线将 CPU 连接到 RAM。
使用 冒泡排序可以快速对 5 个整数进行排序:qsort 会使用更多内存(用于递归)而优化良好的冒泡排序则完全可以从 d-cache 运行。
That's a example of a non-positional number system.
I can't find the link to the theory. I studied that as part of applied algebra, somewhere around the Euler's totient and encryption. (I can be wrong with terminology as I have studied all that in my native language.)
RAM is an external resource and is generally slower compared to the CPU. Sorting 5 of ints would always have to go to RAM due to swap operations. Add here the overhead of sorting function itself, and multiplication stops looking all that bad.
I think on modern CPUs integer multiplication would pretty much always faster than sorting, since several multiplications can be executed at the same time on different ALUs, while there is only one bus connecting CPU to RAM.
5 integers can be sorted quite quickly using bubble sort: qsort would use more memory (for recursion) while well optimized bubble sort would work completely from d-cache.
正如其他人指出的那样,单独排序并不比乘以 5 个值更快。然而,这忽略了他解决方案的其余部分。在鄙视 5 元素排序之后,他继续对 4888 个值的数组进行二分搜索 - 至少 12 次比较,比以往所需的排序次数还要多!
请注意,我并不是说有更好的涉及排序的解决方案 - 我个人没有给予足够的思考 - 只是排序本身只是问题的一部分。
他也不必使用素数。如果他简单地将每张牌的值编码为 4 位,则需要 20 位来表示一手牌,给出的范围为 0 到 2^20 = 1048576,大约是使用素数产生的范围的 1/100,并且足够小(尽管仍然存在缓存一致性问题)来生成一个查找表。当然,一个更有趣的变体是使用 7 张牌(例如德州扑克等游戏中的牌),并找到最好的 5 张牌手牌可以用它们制成。
As others have pointed out, sorting alone isn't quicker than multiplying for 5 values. This ignores, however, the rest of his solution. After disdaining a 5-element sort, he proceeds to do a binary search over an array of 4888 values - at least 12 comparisons, more than the sort ever required!
Note that I'm not saying there's a better solution that involves sorting - I haven't given it enough thought, personally - just that sorting alone is only part of the problem.
He also didn't have to use primes. If he simply encoded the value of each card in 4 bits, he'd need 20 bits to represent a hand, giving a range of 0 to 2^20 = 1048576, about 1/100th of the range produced using primes, and small enough (though still suffering cache coherency issues) to produce a lookup table over.Of course, an even more interesting variant is to take 7 cards, such as are found in games like Texas Holdem, and find the best 5 card hand that can be made from them.
乘法速度更快。
任何给定数组的乘法总是比对数组进行排序更快,假设乘法会产生有意义的结果,并且查找表是无关紧要的,因为代码旨在评估扑克手牌,因此您需要对无论如何排序集。
The multiplication is faster.
Multiplication of any given array will always be faster than sorting the array, presuming the multiplication results in a meaningful result, and the lookup table is irrelevant because the code is designed to evaluate a poker hand so you'd need to do a lookup on the sorted set anyway.
您可以在此处 附有文档并在此处进一步解释。欢迎通过其中的电子邮件地址提供所有反馈。
您不需要排序,并且在评估 7 张牌时,通常(约 97% 的时间)只需进行 6 次加法和几次移位即可。该算法使用生成的查找表,该表占用约 9MB 的 RAM,并且几乎是在瞬间生成的。便宜的。所有这些都是在 32 位内部完成的,“内联”7 张牌评估器非常适合在我的笔记本电脑上评估每秒约 50m 随机生成的手牌。
哦,乘法比排序更快。
An example of a ready made Texas Hold'em 7- and 5-card evaluator can be found here with documentation and further explained here. All feedback welcome at the e-mail address found therein.
You don't need to sort, and can typically (~97% of the time) get away with just 6 additions and a couple of bit shifts when evaluating 7-card hands. The algo uses a generated look up table which occupies about 9MB of RAM and is generated in a near-instant. Cheap. All of this is done inside of 32-bits, and "inlining" the 7-card evaluator is good for evaluating about 50m randomly generated hands per second on my laptop.
Oh, and multiplication is faster than sorting.