如何标准化基准测试结果以获得正确的比率分布？

发布于 2024-12-06 13:30:20 字数 1156 浏览 2 评论 0原文

为了提供一些背景信息，我正在测量虚拟机 (VM) 或一般系统软件的性能，并且通常想要比较性能问题的不同优化。性能是根据多个基准测试的绝对运行时间来衡量的，通常是根据使用的 CPU 核心数量、不同的基准参数等变化的 VM 配置的数量来衡量的。为了获得可靠的结果，每个配置都会测量 100 次。因此，我最终对各种不同参数进行了大量测量，我通常对所有这些参数的加速感兴趣，将虚拟机与未经某种优化的虚拟机进行比较。

我目前所做的就是选择一个特定的测量系列。假设在 1 个核心上运行基准 A 的 VM 的测量结果（有优化和无优化 (VM-norm/VM-opt)）。

由于我想比较不同基准测试和核心数量的结果，因此我不能使用绝对运行时间，但需要以某种方式对其进行标准化。因此，我将 VM-norm 的 1 个核心上基准 A 的 100 个测量值与 VM-opt 的相应 100 个测量值配对，以计算 VM-opt/VM-norm 比率。

当我按照得到的顺序进行测量时，显然我得到的 100 个 VM-opt/VM-norm 比率有相当大的变化。所以，我想，好吧，让我们假设我的测量结果的变化来自非确定性效应，并且相同的效应以相同的方式导致 VM-opt 和 VM-norm 的变化。因此，天真地，在将测量值配对之前对它们进行排序应该是可以的。而且，正如预期的那样，这当然减少了变化。

然而，我的一知半解告诉我，这不是最好的方法，甚至可能不正确。由于我最终对这些比率的分布感兴趣，为了用豆图将它们可视化，一位同事建议使用笛卡尔积而不是配对排序测量。这听起来似乎可以更好地解释配对进行比较的两个任意测量的随机性。但是，我仍然想知道统计学家会对这样的问题提出什么建议。

最后，我真的很感兴趣将 R 的比率分布绘制为豆图或小提琴图。简单的箱线图，或者只是平均值+标准差告诉我关于正在发生的事情的信息太少。这些分布通常指向这些非常复杂的计算机上的复杂交互所产生的工件，这就是我感兴趣的。

非常欢迎任何有关如何使用以及如何以正确的方式生成此类比率的方法的指示。

PS：此为转载，原文发表于https://stats.stackexchange.com/ questions/15947/如何规范化基准结果以正确获得比率分布

原文

To give a bit of the context, I am measuring the performance of virtual machines (VMs), or systems software in general, and usually want to compare different optimizations for performance problem. Performance is measured in absolute runtime for a number of benchmarks, and usually for a number of configurations of a VM variating over used number of CPU cores, different benchmark parameters, etc. To get reliable results, each configuration is measure like 100 times. Thus, I end up with quite a number of measurements for all kind of different parameters where I am usually interested in the speedup for all of them, comparing the VM with and the VM without a certain optimization.

What I currently do is to pick one specific series of measurements. Lets say the measurements for a VM with and without optimization (VM-norm/VM-opt) running benchmark A, on 1 core.

Since I want to compare the results of the different benchmarks and number of cores, I can not use absolute runtime, but need to normalize it somehow. Thus, I pair up the 100 measurements for benchmark A on 1 core for VM-norm with the corresponding 100 measurements of VM-opt to calculate the VM-opt/VM-norm ratios.

When I do that taking the measurements just in the order I got them, I obviously have quite a high variation in my 100 resulting VM-opt/VM-norm ratios. So, I thought, ok, let's assume the variation in my measurements come from non-deterministic effects and the same effects cause variation in the same way for VM-opt and VM-norm. So, naively, it should be ok to sort the measurements before pairing them up. And, as expected, that reduces the variation of course.

However, my half-knowledge tells me that is not the best way and perhaps not even correct.
Since I am eventually interested in the distribution of those ratios, to visualize them with beanplots, a colleague suggested to use the cartesian product instead of pairing sorted measurements. That sounds like it would account better for the random nature of two arbitrary measurements paired up for comparison. But, I am still wondering what a statistician would suggest for such a problem.

In the end, I am really interested to plot the distribution of ratios with R as bean or violin plots. Simple boxplots, or just mean+stddev tell me too few about what is going on. These distributions usually point at artifacts that are produced by the complex interaction on these much to complex computers, and that's what I am interested in.

Any pointers to approaches of how to work with and how to produce such ratios in a correct way a very welcome.

PS: This is a repost, the original was posted at https://stats.stackexchange.com/questions/15947/how-to-normalize-benchmark-results-to-obtain-distribution-of-ratios-correctly

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怪我鬧 2024-12-13 13:30:20

我发现令人费解的是你在“交叉验证”上得到的回应如此之少。这看起来不像是一个特定的 R 问题，而是一个如何设计分析的请求。也许那里的观众认为你问的问题太宽泛，但如果是这样的话，那么 [R] 论坛就更糟糕了，因为我们通常解决实际提供数据的问题。我们用我们的语言处理实施构建的请求。我同意小提琴图比箱线图更适合用于检查分布（当有足够的数据并且我不确定在这种情况下每组 100 个样本是否可以评分），但无论如何，这意味着“R 答案”是您只需要参考正确的 R 帮助页面：

library(lattice)
?xyplot
?panel.violin

进一步的注释将需要更多详细信息，最好是一些用 R 构建的数据示例。您可能需要参考其中 “概述了很棒的问题设计”。

另一种图形方法：如果您对两个配对变量的比率感兴趣，但不想仅“提交”x/y，那么您可以通过绘图来检查它们，然后重复使用 abline(a =0，b=)。我认为 100 个样本对于进行密度估计来说相当“薄”，但是如果您可以收集更多数据，则可以使用 2d 密度方法。

I found it puzzling that you got such a minimal response on "Cross Validated". This does not seem like a specific R question, but rather a request for how to design an analysis. Perhaps the audience there thought you were asking too broad a question, but if that is the case then the [R] forum is even worse, since we generally tackle problems where data is actually provided. We deal with the requests for implementation construction in our language. I agree that violin plots are preferred to boxplots for the examination of distributions (when there is sufficient data and I am not sure that 100 samples per group makes the grade in that instance), but in any case that means the "R answer" is that you just need to refer to the proper R help page:

library(lattice)
?xyplot
?panel.violin

Further comments would require more details and preferably some data examples constructed in R. You may want to refer to the page where "great question design is outlined".

One further graphical method: If you are interested in the ratios of two paired variates but do not want to "commit" to just x/y, then you can examine them by plotting and then plotting iso-ratio lines by repeatedly using abline(a=0, b= ). I think 100 samples is pretty "thin" for doing density estimates, but there are 2d density methods if you can gather more data.

回复收藏 0 原文

~没有更多了~