对生成随机输出的代码进行单元测试的最佳方法是什么?
具体来说,我有一个方法从列表中选择 n 个项目,其中 a% 满足一个标准,b% 满足第二个标准,依此类推。一个简化的示例是选择 5 个项目,其中 50% 的给定属性值为“true”,50% 的值为“false”; 50% 的情况下,该方法会返回 2 个 true/3 个 false,而另外 50% 的情况下,会返回 3 个 true/2 个 false。
从统计上来说,这意味着超过 100 次运行,我应该得到大约 250 个 true/250 个 false,但由于随机性,240/260 是完全可能的。
对此进行单元测试的最佳方法是什么?我假设即使技术上 300/200 是可能的,但如果发生这种情况,测试可能会失败。对于此类情况是否存在普遍接受的容忍度?如果有,您如何确定那是什么?
编辑:在我正在编写的代码中,我没有使用伪随机数生成器的奢侈,也没有强制它随着时间的推移进行平衡的机制,因为挑选出的列表是在不同的生成器上生成的机器。我需要能够证明,随着时间的推移,符合每个标准的项目的平均数量将趋向于所需的百分比。
Specifically, I've got a method picks n items from a list in such a way that a% of them meet one criterion, and b% meet a second, and so on. A simplified example would be to pick 5 items where 50% have a given property with the value 'true', and 50% 'false'; 50% of the time the method would return 2 true/3 false, and the other 50%, 3 true/2 false.
Statistically speaking, this means that over 100 runs, I should get about 250 true/250 false, but because of the randomness, 240/260 is entirely possible.
What's the best way to unit test this? I'm assuming that even though technically 300/200 is possible, it should probably fail the test if this happens. Is there a generally accepted tolerance for cases like this, and if so, how do you determine what that is?
Edit: In the code I'm working on, I don't have the luxury of using a pseudo-random number generator, or a mechanism of forcing it to balance out over time, as the lists that are picked out are generated on different machines. I need to be able to demonstrate that over time, the average number of items matching each criterion will tend to the required percentage.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
随机和统计在单元测试中不受欢迎。单元测试应该始终返回相同的结果。总是。不是大部分。
您可以做的是尝试删除您正在测试的逻辑的随机生成器。然后您可以模拟随机生成器并返回预定义的值。
其他想法:
您可以考虑更改实现以使其更易于测试。尝试获得尽可能少的随机值。例如,您只能获得一个随机值来确定与平均分布的偏差。这很容易测试。如果随机值为零,您应该得到您期望的平均分布。例如,如果该值为 1.0,则您会因某个定义的因素而偏离平均值,例如 10%。您还可以实现一些高斯分布等。我知道这不是这里的主题,但如果您可以随意实现它,请考虑可测试性。
Random and statistics are not favored in unit tests. Unit tests should always return the same result. Always. Not mostly.
What you could do is trying to remove the random generator of the logic you are testing. Then you can mock the random generator and return predefined values.
Additional thoughts:
You could consider to change the implementation to make it more testable. Try to get as less random values as possible. You could for instance only get one random value to determine the deviation from the average distribution. This would be easy to test. If the random value is zero, you should get the exact distribution you expect in average. If the value is for instance 1.0, you miss the average by some defined factor, for instance by 10%. You could also implement some Gaussian distribution etc. I know this is not the topic here, but if you are free to implement it as you want, consider testability.
根据您拥有的统计信息,确定一个范围而不是一个特定的单个值作为结果。
According to the Statistical information you have, determine a range instead of a particular single value as a result.
例如科学计算中的许多概率算法使用伪-随机数生成器 ,而不是true随机数生成器。尽管它们不是真正随机的,但精心选择的伪随机数生成器也能很好地完成这项工作。
伪随机数生成器的优点之一是它们生成的随机数序列完全可重现。由于该算法是确定性的,因此相同的种子将始终生成相同的序列。这通常是首先选择它们的决定因素,因为实验需要可重复,结果需要可重现。
这个概念也适用于测试。组件可以设计为可以插入任何随机数源。为了进行测试,您可以使用始终播种的生成器。结果将是可重复的,适合测试。
请注意,如果实际上需要真随机数,您仍然可以通过这种方式进行测试,只要该组件具有可插入的随机数源即可。您可以将相同的顺序(如果需要的话可能是真正随机的)重新插入到相同的组件中进行测试。
Many probabilistic algorithms in e.g. scientific computing use pseudo-random number generators, instead of a true random number generator. Even though they're not truly random, a carefully chosen pseudo-random number generator will do the job just fine.
One advantage of a pseudo-random number generator is that the random number sequence they produce is fully reproducible. Since the algorithm is deterministic, the same seed would always generate the same sequence. This is often the deciding factor why they're chosen in the first place, because experiments need to be repeatable, results reproducible.
This concept is also applicable for testing. Components can be designed such that you can plug in any source of random numbers. For testing, you can then use generators that are consistently seeded. The result would then be repeatable, which is suitable for testing.
Note that if in fact a true random number is needed, you can still test it this way, as long as the component features a pluggable source of random numbers. You can re-plug in the same sequence (which may be truly random if need be) to the same component for testing.
在我看来,您至少要在这里测试三个不同的东西:
1 应该是确定性的,您可以通过提供一组选定的已知“随机”值和输入并检查它是否产生已知的正确输出来对其进行单元测试。如果您构建代码以便将随机源作为参数传递而不是嵌入到代码中,这将是最简单的。
2和3不能绝对测试。您可以测试某些选定的置信水平,但必须做好此类测试在某些情况下失败的准备。也许您真正想要注意的是测试 3 的失败次数比测试 2 的失败次数要多得多,因为这表明您的算法是错误的。
要应用的测试将取决于预期的分布。对于 2,您很可能期望随机源是均匀分布的。有各种测试,具体取决于您想要参与的程度,请参阅例如 在此页面上测试伪随机数生成器。
3 的预期分布很大程度上取决于您所生产的产品。问题中的简单 50-50 情况完全等同于测试公平硬币,但是显然其他情况会更复杂。如果您可以计算出分布应该是什么,卡方检验反对它可能会有所帮助。
It seems to me there are at least three distinct things you want to test here:
1 should be deterministic and you can unit test it by supplying a chosen set of known "random" values and inputs and checking that it produces the known correct outputs. This would be easiest if you structure the code so that the random source is passed as an argument rather than embedded in the code.
2 and 3 cannot be tested absolutely. You can test to some chosen confidence level, but you must be prepared for such tests to fail in some fraction of cases. Probably the thing you really want to look out for is test 3 failing much more often than test 2, since that would suggest that your algorithm is wrong.
The tests to apply will depend on the expected distribution. For 2 you most likely expect the random source to be uniformly distributed. There are various tests for this, depending on how involved you want to be, see for example Tests for pseudo-random number generators on this page.
The expected distribution for 3 will depend very much on exactly what you're producing. The simple 50-50 case in the question is exactly equivalent to testing for a fair coin, but obviously other cases will be more complicated. If you can work out what the distribution should be, a chi-square test against it may help.
这取决于您对测试套件的使用。如果您因为接受测试驱动开发和积极的重构而每隔几秒钟运行一次,那么它不会虚假失败就非常重要,因为这会导致重大中断并降低生产力,因此您应该选择一个实际上不可能的阈值以达到良好的实施效果。如果您每晚运行一次测试并有一些时间来调查故障,您可以更加严格。
在任何情况下,您都不应部署会导致频繁出现未经调查的故障的东西 - 这违背了测试套件的全部目的,并大大降低了其对团队的价值。
That depends on the use you make of your test suite. If you run it every few seconds because you embrace test-driven development and aggressive refactoring, then it is very important that it doesn't fail spuriously, because this causes major disruption and lowers productivity, so you should choose a threshold that is practically impossible to reach for a well-behaved implementation. If you run your tests once a night and have some time to investigate failures you can be much stricter.
Under no circumstances should you deploy something that will lead to frequent uninvestigated failures - this defeats the entire purpose of having a test suite, and dramatically reduces its value to the team.
您应该在“单个”单元测试中测试结果的分布,即在任何单独的运行中结果尽可能接近所需的分布。对于您的示例,结果是 2 true / 3 false 可以,4 true / 1 false 则不行。
您还可以编写执行该方法(例如 100 次)的测试,并检查分布的平均值是否“足够接近”所需的速率。这是一个边界情况 - 运行更大的批次可能需要大量时间,因此您可能希望将这些测试与“常规”单元测试分开运行。此外,正如 Stefan Steinegger 指出的那样,如果您将“足够接近”定义得更严格,这样的测试有时会失败,或者如果您将阈值定义得太宽松,则开始变得毫无意义。所以这是一个棘手的案子......
You should test the distribution of results in a "single" unit test, i.e. that the result is as close to the desired distribution as possible in any individual run. For your example, 2 true / 3 false is OK, 4 true / 1 false is not OK as a result.
Also you could write tests which execute the method e.g. 100 times and checks that the average of the distributions is "close enough" to the desired rate. This is a borderline case - running bigger batches may take a significant amount of time, so you might want to run these tests separately from your "regular" unit tests. Also, as Stefan Steinegger points out, such a test is going to fail every now and then if you define "close enough" stricter, or start being meaningless if you define the threshold too loosely. So it is a tricky case...
我想如果我遇到同样的问题,如果你有一些关于平均值/标准差等的统计数据,我可能会构建一个置信区间来检测异常。因此,在您的情况下,如果平均预期值为 250,则使用正态分布在平均值周围创建 95% 的置信区间。如果结果超出该范围,则测试失败。
请参阅更多
I think if I had the same problem I probably construct a confidence interval to detect anomalies if you have some statistics about average/stddev and such. So in your case if the average expected value is 250 then create a 95% confidence interval around the average using a normal distribution. If the results are outside that interval you fail the test.
see more
为什么不重构随机数生成代码并让单元测试框架和源代码都使用它呢?您正在尝试测试您的算法而不是随机序列,对吗?
Why not re-factor the random number generation code and let the unit test framework and the source code both use it? You are trying to test your algorithm and not the randomized sequence right?
首先,您必须知道随机数生成过程应该产生什么分布。在您的情况下,您生成的结果为 0 或 1,概率为 -0.5。这描述了 p=0.5 的二项式分布。
给定 n 的样本大小,您可以构建(如早期海报所建议的)围绕平均值的置信区间。您还可以对获得的概率做出各种陈述,例如,当 n=500 时,任一结果的概率为 240 或更少。
只要 p 不是很大或很小,您就可以对大于 20 的 N 值使用正态分布假设。维基百科帖子对此有更多内容。
First you have to know what distribution should result from your random number generation process. In your case you are generating a result which is either 0 or 1 with probability -0.5. This describes a binomial distribution with p=0.5.
Given the sample size of n, you can construct (as an earlier poster suggested) a confidence interval around the mean. You can also make various statements about the probability of getting, for instance, 240 or less of either outcome when n=500.
You could use a normal distribution assumption for values of N greater than 20 as long as p is not very large or very small. The Wikipedia post has more on this.