辛普森悖论在AB测试中意味着什么?
我正在进行 A/B 测试,结果中面临辛普森悖论(天与月与测试总持续时间)。
- 这是否意味着我的 a/b 测试不正确/不具有代表性? (某些外部因素影响了测试?)
- 如果这是问题的迹象,应遵循哪些指示?
感谢您的大力帮助。
I am doing A/B testing and I am facing Simpson's paradox in my results (day vs month vs total duration of the test).
- Does it mean that my a/b testing is not correct/representative? (Some external factor impacted the testing?)
- If it is a sign of problem, what are the directions to follow?
Thanks for your great help.
Further reading: http://en.wikipedia.org/wiki/Simpson%27s_paradox
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果没有看到确切的数据和信息,很难说。您正在测试的维度,但一般来说您希望根据未组合的数据做出决策。 微软的这篇文章给出了软件测试中辛普森悖论的一个非常清晰的例子。
您能否提供合并和未合并数据的清晰示例以及测试的简要摘要?
It's a little difficult to say without seeing the exact data & the dimensions you are testing, but generally speaking you want to make decisions based on the uncombined data. This article from Microsoft gives a pretty clear example of Simpson's paradox in software testing.
Can you provide a clean example of your combined and uncombined data and a brief summary of the test?
如果 A 在单独的 A/B 测试中明显更好,而 B 在总体上得分更好,那么主要的含义是您无法以这种方式聚合这些数据集。 A更好。
如果每天的测试都得到相同的结果,即使每天的样本量不同,您也不会得到这个清晰的结果。所以我认为这还意味着某些事情已经发生了变化。不过,它可以是任何东西。也许您每天测试的内容都发生了变化(可能以某种非常微妙的方式,例如服务器速度)。或者,也许你正在测试的人发生了变化(也许是人口统计方面的变化,也许只是他们的情绪方面的变化)。这并不意味着您的测试不好或无效。这只是意味着您正在测量正在移动的东西,这使得事情变得棘手。
我可能会误判或误解情况,但我认为也必然是这样的,你没有测试 A 和 B 相同的次数。也就是说,如果周一你测试了 A 50 次,B 50 次,周二你测试了 A 600 次,B 600 次,依此类推,每天 A 的得分都超过了 B,那么我不知道你怎么能得到B 击败 A 的聚合结果。如果您的测试设置确实如此,那么您似乎确实可以修复某些问题,以使您的数据更易于推理。
If A is clearly, significantly better in individual A/B tests, while B scores better in aggregate, then the main implication is that you can't aggregate those data sets that way. A is better.
If the testing got the same results every day, you wouldn't get this clear result, even with varying sample sizes per day. So I think it additionally implies that something has changed. It could be anything, though. Maybe what you tested each day changed (perhaps in some very subtle way, like server speed). Or maybe the people you're testing it on changed (perhaps demographically, perhaps just in terms of their mood). That doesn't mean your testing is bad or invalid. It just means you're measuring something that's moving, and that makes things tricky.
And I might be miscalculating or misunderstanding the situation, but I think it is also necessarily true that you haven't been testing A and B the same number of times. That is, if on Monday you tested A 50 times and B 50 times, and on Tuesday you tested A 600 times and B 600 times, and so on, and A outscored B each day, then I don't see how you could get an aggregate result where B beats A. If this is true of your test setup, it certainly seems like something you could fix to make your data easier to reason about.
辛普森悖论仅在小组规模不同时发生。实际上,最终结果是每组结果的加权平均值(在这个加权上,可能会出现悖论)。
它实际上不是由外部因素或事物引起的。这只是因为一组更重要(因为该组中有更多元素)。
如果您提供更多信息,我们可能可以提供更好的帮助。
The Simpson's paradox only occurs when your group sizes are different. Actually, the ginal results is a weighted average for the results from each group (and on this weighting, the paradox may come up).
It's not actually caused by external factors or stuff. It's simply because one group is much more significant (because has more elements in the group).
If you provide some more info, we could probably help better.