现代硬件的算法?
我再次发现自己有一套破碎的假设。该文章本身介绍了通过修改经过验证的最佳算法来解决虚拟内存问题,从而实现 10 倍的性能提升:
在现代多问题 CPU 上,运行 在一些千兆赫兹时钟频率下, 最坏情况损失近千万 每个虚拟机页面错误的指令。如果你 与旋转盘一起运行, 数字更像是一亿 说明。
O(log2(n)) 算法有什么好处 如果这些操作导致页面错误 和缓慢的磁盘操作?对于大多数人来说 相关数据集 O(n) 甚至 O(n^2)算法,避免了页面 错误,会绕着它转圈。
还有更多这样的算法吗?我们是否应该重新审视我们教育的所有这些基本组成部分?自己写的时候还需要注意什么吗?
澄清:
所讨论的算法并不比经过验证的最佳算法更快,因为 Big-O 表示法有缺陷或毫无意义。它更快是因为经过验证的最佳算法依赖于现代硬件/操作系统中不正确的假设,即所有内存访问都是平等且可互换的。
Once again, I find myself with a set of broken assumptions. The article itself is about a 10x performance gain by modifying a proven-optimal algorithm to account for virtual memory:
On a modern multi-issue CPU, running
at some gigahertz clock frequency, the
worst-case loss is almost 10 million
instructions per VM page fault. If you
are running with a rotating disk, the
number is more like 100 million
instructions.What good is an O(log2(n)) algorithm
if those operations cause page faults
and slow disk operations? For most
relevant datasets an O(n) or even an
O(n^2) algorithm, which avoids page
faults, will run circles around it.
Are there more such algorithms around? Should we re-examine all those fundamental building blocks of our education? What else do I need to watch out for when writing my own?
Clarification:
The algorithm in question isn't faster than the proven-optimal one because the Big-O notation is flawed or meaningless. It's faster because the proven-optimal algorithm relies on an assumption that is not true in modern hardware/OSes, namely that all memory access is equal and interchangeable.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
仅当您的客户抱怨您的程序运行缓慢或错过了关键的最后期限时,您才需要重新检查您的算法。否则,请关注正确性、稳健性、可读性和易于维护性。在实现这些目标之前,任何性能优化都是浪费开发时间。
页面错误和磁盘操作可能是特定于平台的。始终分析您的代码以查看瓶颈所在。花时间在这些领域将产生最大的好处。
如果您感兴趣,除了页面错误和磁盘操作缓慢之外,您可能还想了解:
分支/跳跃。
它们适合处理器的缓存。
同样,这些项目仅在质量达到、客户投诉以及分析员分析了您的程序之后才会出现。
You only need to re-examine your algorithms when your customers complain about the slowness of your program or it is missing critical deadlines. Otherwise focus on correctness, robustness, readability, and ease of maintenance. Until these items are achieved any performance optimization is a waste of development time.
Page faults and disk operations may be platform specific. Always profile your code to see where the bottlenecks are. Spending time on these areas will produce the most benefits.
If you're interested, along with page faults and slow disk operations, you may want to aware of:
branches / jumps.
they fit into the processor's cache.
Again, these items are only after quality has been achieved, customer complaints and a profiler has analyzed your program.
一件重要的事情是认识到大 O 表示法(谈论运行时复杂性)的最常见用法只是故事的一半 - 还有另一半,即空间复杂性(也可以使用 big-O 表示)也可能非常相关。
一般来说,如今,内存容量的进步已经超过了计算速度的进步(对于单核 - 并行化可以解决这个问题),因此较少关注空间复杂性,但这仍然是一个应该牢记的因素,特别是在具有内存更有限或处理大量数据时。
One important thing is to realize that the most common usage of big-O notation (to talk about runtime complexity) is only half of the story - there's another half, namely space complexity (that can also be expressed using big-O) which can also be quite relevant.
Generally these days, memory capacity advances have outpaced computing speed advances (for a single core - parallelization can get around this), so less focus is given to space complexity, but it's still a factor that should be kept in mind, especially on machines with more limited memory or if working with very large amounts of data.
我将扩展 GregS 的答案:区别在于有效复杂性和渐近复杂性。渐近复杂度忽略常数因子,并且仅对“足够大”的输入有效。通常,“足够大”实际上意味着“比任何计算机现在和几十年内可以处理的都要大”;这就是该理论(理所当然地)名声不佳的地方。当然也有“足够大”意味着n=3的情况!
一种更复杂(因此更准确)的看待这个问题的方法是首先问“您感兴趣的问题的规模范围是多少?”然后,您需要测量该大小范围内各种算法的效率,以了解“隐藏常量”。或者您可以使用更精细的渐近算法方法,它实际上可以为您提供对常数的估计。
另一件需要注意的事情是“过渡点”。当然,运行 2n2 时间的算法会比运行 1016n 时间的算法快log(n) 次,所有 n n n n n n 1.99 * 1017。因此,二次算法将是您的选择(除非您正在处理 CERN 担心的数据大小)。即使是次指数项也很困难 - 3n3 比 n3 + 10 好很多>16n2 对于n n 2 5*1015(假设这些是实际的复杂性)。
I'll expand on GregS's answer: the difference is between effective complexity and asymptotic complexity. Asymptotic complexity ignores constant factors and is valid only for “large enough” inputs. Oftentimes “large enough” can actually mean “larger than any computer can deal with, now and for a few decades”; this is where the theory (justifiably) gets a bad reputation. Of course there are also cases where “large enough” means n=3 !
A more complex (and thus more accurate) way of looking at this is to first ask “what is the size range of problems you are interested in?” Then you need to measure the efficiency of various algorithms in that size range, to get a feeling for the ‘hidden constants’. Or you can use finer methods of algorithmic asymptotics which actually give you estimates on the constants.
The other thing to look at are ‘transition points’. Of course an algorithm which runs in 2n2 time will be faster than one that runs in 1016nlog(n) times for all n < 1.99 * 1017. So the quadratic algorithm will be the one to choose (unless you are dealing with the sizes of data that CERN worries about). Even sub-exponential terms can bite – 3n3 is a lot better than n3 + 1016n2 for n < 5*1015 (assuming that these are actual complexities).
我认为没有任何不成立的假设。 Big-O 表示法是在非常非常简化的理想化计算机上衡量算法复杂性的一种方法,并且忽略常数项。显然,这并不是实际机器上实际速度的最终结论。
There are no broken assumptions that I see. Big-O notation is a measure of algorithmic complexity on a very, very, simplified idealized computing machine, and ignoring constant terms. Obviously it is not the final word on actual speeds on actual machines.
O(n) 只是故事的一部分——一个很大的部分,而且经常是主导部分,但并不总是主导部分。一旦你开始进行性能优化(这不应该在你的开发),您需要考虑您正在使用的所有资源。您可以概括阿姆达尔定律,这意味着您的执行时间将由最有限的时间主导资源。请注意,这也意味着还必须考虑您正在执行的特定硬件。对于大规模并行计算机(例如,CM 或 MassPar)高度优化且极其高效的程序可能在大型向量盒(例如,Cray-2)或高速微处理器上表现不佳。该程序甚至可能在大量有能力的微处理器(映射/归约风格)上表现不佳。对缓存、CPU通信、I/O、CPU速度、内存访问等不同平衡的不同优化意味着不同的性能。
当我花时间进行性能优化时,我们会努力在整个系统上实现“平衡”的性能。具有慢速 I/O 系统的超快 CPU 几乎没有意义,等等。 O() 通常只考虑 CPU 复杂度。您也许可以权衡内存空间(展开循环没有 O() 意义,但它确实经常有助于实际性能);关注缓存命中、副线性内存布局、副内存组命中; 虚拟与真实内存;磁带、旋转磁盘、RAID,等等。如果您的性能主要由 CPU 活动主导,并且 I/O 和内存闲置,那么 big-O 就是您最关心的问题。如果您的 CPU 利用率为 5% 而网络利用率为 100%,也许您可以摆脱 big-O 并致力于 I/O、缓存等。
多线程(尤其是多核)可以进行所有这些分析甚至更复杂。这引发了非常广泛的讨论。如果您有兴趣,Google 可以为您提供数月或数年的参考资料。
O(n) is only part of the story -- a big part, and frequently the dominant part, but not always the dominant part. Once you get to performance optimization (which should not be done too early in your development), you need to consider all of the resources you are using. You can generalize Amdahl's Law to mean that your execution time will be dominated by the most limited resource. Note that also means that the particular hardware on which you're executing must also be considered. A program that is highly optimized and extremely efficient for a massively parallel computer (e.g., CM or MassPar) would probably not do well on a big vector box (e.g., Cray-2) nor on a high-speed microprocessor. That program might not even do well on a massive array of capable microprocessors (map/reduce style). The different optimizations for the different balances of cache, CPU communications, I/O, CPU speed, memory access, etc. mean different performance.
Back when I spent time working on performance optimizations, we would strive for "balanced" performance over the whole system. A super-fast CPU with a slow I/O system rarely made sense, and so on. O() typically considers only CPU complexity. You may be able to trade off memory space (unrolling loops doesn't make O() sense, but it does frequently help real performance); concern about cache hits vice linear memory layouts vice memory bank hits; virtual vs real memory; tape vs rotating disk vs RAID, and so on. If your performance is dominated by CPU activity, with I/O and memory loafing along, big-O is your primary concern. If your CPU is at 5% and the network is at 100%, maybe you can get away from big-O and work on I/O, caching, etc.
Multi-threading, particularly with multi-cores, makes all of that analysis even more complex. This opens up into very extensive discussion. If you are interested, Google can give you months or years of references.