能否通过分析算法的性能以编程方式找到算法的 bigO?
请注意,我没有“问题”,我也不是在寻找“另一种方法来找到我的算法的大O”。
我想知道的是,是否可以编写一个程序,向其中传递数据点,这些数据点都是对各种输入大小的算法的性能测量:(n,解决问题所花费的时间n)
这将决定算法的复杂性。
例如,输入可能是这样的(它可能更大,这实际上只是一个例子,这不是问题的重点):
36 000 took 16 ms
109 000 took 21 ms
327 000 took 68 ms
984 000 took 224 ms
2 952 000 took 760 ms
8 857 000 took 2305 ms
26 571 000 took 7379 ms
79 716 000 took 23336 ms
使用这种类型的数据,是否可以编写一个程序来告诉我们是否有,比如说,O(n)
、log(n)
、n log(n)
或 n!
算法?
Note that I don't have a "problem" and I'm not looking for "another way to find the big O of my algorithm".
What I'd like to know is if it would be possible write a program to which you'd pass data points that would all be perfs measurements of an algorithm for various input size: (n,time taken to solve problem for n)
and that would then determine the complexity of your algorithm.
For example here's what the input could be (it could be much larger, it's really just an example, that's not the point of the question):
36 000 took 16 ms
109 000 took 21 ms
327 000 took 68 ms
984 000 took 224 ms
2 952 000 took 760 ms
8 857 000 took 2305 ms
26 571 000 took 7379 ms
79 716 000 took 23336 ms
Using such kind of data, is it possible to write a program that would tell if we have, say, an O(n)
, log(n)
, n log(n)
or n!
algo?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您正在寻找的是曲线拟合。我所知道的解决这个问题的所有简单算法都会尝试将数据点拟合到某种多项式中,但我怀疑有些算法也能够区分多项式和非多项式。
What you are looking for is Curve fitting. All the simple algorithms for this problem that I know of will try to fit the data points into some sort of polynomial, but I suspect there're those that will be able to differentiate between polynomials and non-polynomials too.
您可以使用曲线拟合(请参阅@Max S.)来确定描述数据的公式。然而,这只是故事的一半,因为无法知道数据是否完整地描述了您的算法。
例如,您的算法可能会呈现 n < 的线性行为。 1,000,000,000,然后开始呈二次方行为。如果您没有 n > 的数据点1,000,000,000 那么你的分析程序将无法给你正确的答案。
因此,总而言之,您可以通过编程方式完成此操作,但结果将仅限于样本中的数据点。并且没有算法方法可以确定样本是否充分覆盖所有“有趣”的点。
You can use curve fitting (see @Max S.) for determining the formula that describes your data. However, this just half of the story, as there is no way to know if the data describes your algorithm to its full extent.
For instance, your algorithm may present linear behavior for n < 1,000,000,000 and then start behaving quadratically. If you don't have data point where n > 1,000,000,000 then your analysis program will not be able to give you a correct answer.
Thus, to conclude you can do it programmatically, but the results will limited to the data points in your sample. And there is no algorithmic way to determine if the sample sufficiently covers all "interesting" points.
如果您尝试凭经验估计大 O,则必须非常小心,以确保在每种大小的各种实例上进行测试。请记住,大 O 是一个最坏情况概念。除了少数病态情况外,发现在几乎所有情况下都表现良好的算法并不罕见,但正是这些病态情况决定了大 O 时间。也就是说,如果您错过了采样中的病理情况,您可能会认为 O(2^n) 算法实际上是 O(n)。
如果您真正需要的是大 O 时间,而不仅仅是平均性能的想法,那么我建议通过分析来证明它。如果不这样做,你就不能确定你没有错过一些病态的输入。
If you're trying to estimate big-O empirically, you must be very careful to ensure that you're testing on a wide range of instances at each size. Remember that big-O is a worst-case notion. It is not uncommon to find algorithms which perform well in nearly all but a few pathological cases, but it is exactly those pathological cases which determine the big-O time. That is, if you miss the pathological cases in your sampling, you could come away with the idea that an O(2^n) algorithm is O(n) instead.
If what you really need is the big-O time, and not just an idea of average performance, then I recommend proving it analytically. Without doing that, you cannot be certain that you haven't missed some pathological input.
我认为你可以使用回归来近似它,但不能得到准确的结果。这是因为大多数算法具有不同的性能,具体取决于输入的内容(而不仅仅是大小)。因此,要充分理解这一点,您需要来源。
I think you could approximate it using regressions, but not get exact results. That's because most algorithms have different performance depending on what the input is (not just the size). So to understand this fully, you would need the source.
大多数大 O 假设理想的机器具有无限内存,具有统一的访问时间,不受其他应用程序的影响等。特别是当您超过缓存大小、主内存大小(交换文件的分页调入/调出)等阈值时,对性能产生重大影响。因此,您确定的是算法在现实世界中的执行情况,而不是理想化的运行时间。
Most big-O assumme a idealized machine with infinite memory with uniform time to access, no influence of other applications, etc etc. Especially when you go over thresholds like cache sizes, main memory sizes (paging in/out of the swapfile) can have a significant influence on the performance. So what you determine is how the algorithm is performing in a real world and not it's idealized runtime.