如何比较两种数据结构的运行时间?运营
我想比较两个整数搜索树(AVL 树与红黑树)的性能。那么我应该如何设计/设计测试来实现这一目标?例如,让我们考虑插入操作,我应该遵循哪些步骤才能表明该操作在 RB 情况下平均更快?我应该只插入一个元素(假设树已预先填充)还是应该对一系列插入进行计时?另外,我应该考虑哪些因素来正确测量 CPU 时间?
提前致谢。
I want to compare the performance of two search trees of integers (an AVL tree vs a RedBlack tree). So how should I design/engineer the tests to accomplish this ? For instance, let's consider the insert operation, what steps should I follow in order to state that on average this operation is faster in the RB case ? Should I time inserting just one element (assuming the trees are pre-populated) or should I time a sequence of insertions ? Also what considerations should I take to correctly measure CPU time accurately ?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个非常广泛的问题,因此,我认为您不应该希望任何人来到这里并为您提供有关如何衡量绩效的最终正确答案。话虽如此......
首先,您应该开发一套测试。有两种流行的技术可以实现此目的:监视应用程序执行的实际操作序列(因此,找到一些使用 AVL 或 RB 树的开源应用程序,并添加一些代码来打印它执行的操作序列)或通过分析(或综合)创建这样的操作流,以针对任意数量的情况(平均使用情况、特定类型的异常或其他异常使用情况、随机使用情况等)。您测试的此类痕迹越多越好。
一旦有了要测试的跟踪集,您就需要开发一个驱动程序来进行评估。驱动程序应该很简单,对于 AVL 和 RB 树都是相同的(我认为在这种情况下,这不应该成为问题;两者都向用户提供相同的接口,仅在内部实现细节方面有所不同)。驱动程序应该能够有效地重现跟踪集中记录的使用情况,并导致对数据结构执行跟踪操作。我喜欢做的一件事是加入第三个“虚拟”候选人,它什么都不做。通过这种方式,我可以看到跟踪处理对整体性能的影响有多大。
每个跟踪都应该执行很多很多次。您可以将其形式化(以将统计不确定性降低到已知范围内),但经验法则是错误的顺序将根据 1/sqrt(n) 缩小,其中 n 是试验次数。换句话说,通过运行每个跟踪 10,000 次而不是 100 次,您得到的平均值误差会小 10 倍。记录所有值;需要寻找的是均值、中位数、众数等。对于每次运行,尽量保持系统条件相同;没有其他程序在运行等。为了帮助消除由于外部因素变化而导致的虚假结果,您可以剔除底部和顶部 10% 的异常值...
现在,只需比较数据集即可。也许您最关心的是跟踪所需的平均时间?也许是最糟糕的?也许你真正关心的是一致性;标准差是大还是小?您应该有足够的数据来比较在两个测试结构上执行的给定跟踪的结果;对于不同的跟踪,查看不同的数字可能更有意义(例如,如果您创建了一个综合基准,该基准应该是 RB 树的最坏情况,您可能会问 RB 和 AVL 树的表现有多糟糕,而您可能不会关心这个代表 AVL 树最佳情况的另一个跟踪,等等)
CPU 上的时序本身就是一个挑战。您需要确保计时器的分辨率足以测量您的事件。 Clock() 和 gettimeofday() 函数以及其他函数是记录事件时间的流行选择。如果您的跟踪完成得太快,您可以获取多次试验的总时间(这样,如果您的计时器支持微秒计时并且您的跟踪在 10 微秒内完成,您可以测量跟踪的 100 次执行而不是 1 次,并获取时间值10 毫秒,应该是准确的)。
另一个潜在的陷阱是每次都提供相同的执行环境。在跟踪运行之间,至少,您可以考虑确保从干净的缓存开始的技术。要么这样,要么不要为第一次执行计时,或者了解当您消除异常值时可能会剔除此结果。重置缓存可能更安全(通过操作某个大型数组的每个元素,例如在跟踪执行之间),因为代码 A 可能会受益于缓存中的某些值,而代码 B 可能会受到影响。
这些是您在进行自己的绩效评估时可能考虑的一些事项。其他工具(例如 PAPI 和其他分析器)可以测量某些事件(缓存命中/未命中、指令等),并且这些信息可以允许比挂钟运行时间的简单比较更丰富的比较。
This is a really broad question, and as such, I don't think you should be hoping for anybody to get on here and give you the one final correct answer regarding how to measure performance. That being said...
First, you should develop a suite of tests. Two popular techniques exist for doing this: monitor a real-world sequence of operations done by an application (so, find some open source application that uses either an AVL or RB tree, and add some code to print out the sequence of operations it performs) or create such a stream of operations analytically (or synthetically) to target any number of cases (the average usage, particular kinds of abnormal or otherwise unusual usage, random usage, etc.). The more of these kinds of traces you get to test, the better.
Once you have your set of traces to test, you need to develop a driver to do the evaluation. The driver should be simple, the same for both AVL and RB trees (I think that in this case, this shouldn't be a problem; both present the same interface to users, differing only in terms of internal implementation details). The driver should be able to reproduce the usage recorded in your trace sets efficiently and cause the traced operations to be carried out on your data structures. One thing I like to do is to include a third "dummy" candidate that does nothing; this way, I can see how much of an influence the processing of traces is exerting on overall performance.
Each trace should be executed many, many times. You can formalize this somewhat (to reduce statistical uncertainty to within known bounds), but a rule of thumb is that the order of your error will shrink according to 1/sqrt(n), where n is the number of trials. In other words, by running each trace 10,000 times instead of 100 times, you will get errors in the average that are 10x smaller. Record all values; things to look for are the mean, median, mode(s), etc. For each run, try to keep the system conditions the same; no other programs running, etc. To help eliminate spurious results due to external factors changing, you can cull the bottom and top 10% of outliers...
Now, simply compare the data sets. Perhaps what you care most about is the average time the trace takes? Perhaps the worst? Maybe what you really care about is consistency; is the standard deviation big or small? You should have enough data to compare the results for a given trace executed on both test structures; and for different traces, it might make more sense to look at different figures (for instance, if you created a synthetic benchmark that should be the worst case for RB trees, you might ask how badly RB and AVL trees did, whereas you might not care about this for another trace representing the best case for AVL trees, etc.)
Timing on the CPU can be a challenge in its own right. You'll need to ensure that the resolution of your timer is sufficient for measuring your events. clock() and gettimeofday() functions - and others - are popular choices for recording the time of events. If your traces finish too quickly, you can get the aggregate time for several trials (so that if your timer supports microsecond timing and your traces finish in 10 microseconds, you can measure 100 executions of the trace instead of 1, and get time values on 10s of milliseconds, which should be accurate).
Another potential pitfall is providing the same execution environment each time. In between trace runs, at the very least, you might consider techniques for ensuring that you start with a clean cache. Either that, or don't time the first execution, or understand that this result might be culled when you eliminate outliers. It might be safer to just reset the cache (by manipulating every element of some large array, for instance in between executions of traces), since code A might benefit from having some of the values in cache while code B might suffer.
These are a few of the things you might consider when doing your own performance evaluation. Other tools - like PAPI and other profilers, for instance - can measure certain events - cache hits/misses, instructions, etc. - and this information can allow for much richer comparisons than simple comparisons of wall-clock run time.
根据您特定的编程语言、实现等,准确测量 CPU 时间可能非常棘手。例如,使用 Java 的 JIT 编译,结果可能会非常不同,具体取决于您之前运行代码的次数!
您能更详细地介绍一下您的情况吗?
Measuring CPU time accurately can be very tricky depending on your particular programming language, implementation, etc. For example, with Java's JIT compilation, the results can be extremely different depending on how much you've run the code before now!
Can you give more detail about your situation?