从 MySQL 中选择随机行(有概率)
我有一个 MySQL 表,其中有一行名为 cur_odds,它是一个百分比数字,表示该行被选中的百分比概率。例如,当您运行 100 个查询时,如何进行查询以大约该频率实际选择行?
我尝试了以下操作,但概率为 0.35 的行最终在大约 60-70% 的时间内被选中。
SELECT * FROM table ORDER BY RAND()*cur_odds DESC
表中cur_odds的所有值加起来正好为1。
I have a MySQL table that has a row called cur_odds which is a percent number with the percent probability that that row will get selected. How do I make a query that will actually select the rows in approximately that frequency when you run through 100 queries for example?
I tried the following, but a row that has a probability of 0.35 ends up getting selected around 60-70% of the time.
SELECT * FROM table ORDER BY RAND()*cur_odds DESC
All the values of cur_odds in the table add up to 1 exactly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果
cur_odds
很少更改,您可以实现以下算法:1) 创建另一列
prob_sum
,其中2) 生成 0 到 1 之间的随机数:
3) 查找
prob_sum > 的第一行rnd
(如果您在prob_sum
上创建 BTREE 索引,查询应该会更快):If
cur_odds
is changed rarely you could implement the following algorithm:1) Create another column
prob_sum
, for which2) Generate a random number from 0 to 1:
3) Find the first row for which
prob_sum > rnd
(if you create a BTREE index on theprob_sum
, the query should work much faster):根据上面的 SQL 语句,无论
cur_odds
中的数字都不是每行被选择的概率,而只是一个任意权重(相对于“权重”)所有其他行),这可以最好地解释为向排序表顶部浮动的相对趋势。每行中的实际值是没有意义的(例如,您可以有 4 行,其值为 0.35、0.5、0.75 和 0.99,或者您可以有值为 35、50、75 和 99,结果将是相同的)。更新:以下是您的查询所发生的情况。您有一行
cur_odds
值为 0.35。为了便于说明,我假设其他 9 行都具有相同的值 (0.072)。另外,为了便于说明,我们假设 RAND() 返回一个从 0.0 到 1.0 的值(实际上可能如此)。每次运行此 SELECT 语句时,都会通过将其
cur_odds
值乘以 0.0 到 1.0 之间的 RAND() 值来为每行分配一个排序值。这意味着具有 0.35 的行的排序值将在 0.0 到 0.35 之间。每隔一行(值为 0.072)的排序值将在 0.0 到 0.072 之间。这意味着您的一行有大约 80% 的机会具有大于 0.072 的排序值,这意味着任何其他行不可能可以排序得更高。这就是为什么
cur_odds
值为 0.35 的行首先出现的频率比您预期的要高。我错误地将
cur_odds
值描述为相对变化权重。它实际上起到最大相对权重的作用,然后涉及一些复杂的数学来确定所涉及的实际相对概率。我不确定你需要什么可以用直接的 T-SQL 来完成。我已经多次实现了加权概率选择器(讽刺的是,今天早上我什至想问一个关于最佳方法的问题),但总是在代码中。
Given your above SQL statement, whatever numbers you have in
cur_odds
are not the probabilities that each row is selected, but is instead just an arbitrary weighting (relative to the "weights" of all the other rows) which could instead be best interpreted as a relative tendency to float towards the top of the sorted table. The actual value in each row is meaningless (e.g. you could have 4 rows with values of 0.35, 0.5, 0.75 and 0.99, or you could have values of 35, 50, 75 and 99, and the results would be the same).Update: Here's what's going on with your query. You have one row with a
cur_odds
value of 0.35. For the sake of illustration, I'm going to assume that the other 9 rows all have the same value (0.072). Also for the sake of illustration, let's assume RAND() returns a value from 0.0 to 1.0 (it may actually).Every time you run this SELECT statement, each row is assigned a sorting value by multiplying its
cur_odds
value by a RAND() value from 0.0 to 1.0. This means that the row with a 0.35 will have a sorting value between 0.0 and 0.35.Every other row (with a value of 0.072) will have sorting values ranging between 0.0 and 0.072. This means that there is an approximately 80% chance that your one row will have a sorting value greater than 0.072, which would mean that there is no possible chance that any other row could be sorted higher. This is why your row with the
cur_odds
value of 0.35 is coming up first more often than you expect.I incorrectly described the
cur_odds
value as a relative change weighting. It actually functions as a maximum relative weighting, which would then involve some complex math to determine the actual relative probabilities involved.I'm not sure what you need can be done with straight T-SQL. I've implemented a weighted probability picker many times (I was even going to ask a question about best methods for this this morning, ironically) but always in code.