从 MySQL 中选择随机行(有概率)

发布于 2024-08-30 00:01:42 字数 254 浏览 2 评论 0原文

我有一个 MySQL 表,其中有一行名为 cur_odds,它是一个百分比数字,表示该行被选中的百分比概率。例如,当您运行 100 个查询时,如何进行查询以大约该频率实际选择行?

我尝试了以下操作,但概率为 0.35 的行最终在大约 60-70% 的时间内被选中。

SELECT * FROM table ORDER BY RAND()*cur_odds DESC

表中cur_odds的所有值加起来正好为1。

I have a MySQL table that has a row called cur_odds which is a percent number with the percent probability that that row will get selected. How do I make a query that will actually select the rows in approximately that frequency when you run through 100 queries for example?

I tried the following, but a row that has a probability of 0.35 ends up getting selected around 60-70% of the time.

SELECT * FROM table ORDER BY RAND()*cur_odds DESC

All the values of cur_odds in the table add up to 1 exactly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

荭秂 2024-09-06 00:01:42

如果 cur_odds 很少更改,您可以实现以下算法:

1) 创建另一列 prob_sum,其中

prob_sum[0] := cur_odds[0]

对于 1 <= i <= row_count - 1:

prob_sum[i] := prob_sum[i - 1] + cur_odds[i]

2) 生成 0 到 1 之间的随机数:

rnd := 兰特(0,1)

3) 查找 prob_sum > 的第一行rnd (如果您在 prob_sum 上创建 BTREE 索引,查询应该会更快):

在<表>上创建索引prob_sum_ind; (概率总和);

SET @rnd := RAND();

从<表>中选择MIN(prob_sum)其中 prob_sum > @rnd;

If cur_odds is changed rarely you could implement the following algorithm:

1) Create another column prob_sum, for which

prob_sum[0] := cur_odds[0]

for 1 <= i <= row_count - 1:

prob_sum[i] := prob_sum[i - 1] + cur_odds[i]

2) Generate a random number from 0 to 1:

rnd := rand(0,1)

3) Find the first row for which prob_sum > rnd (if you create a BTREE index on the prob_sum, the query should work much faster):

CREATE INDEX prob_sum_ind ON <table> (prob_sum);

SET @rnd := RAND();

SELECT MIN(prob_sum) FROM <table> WHERE prob_sum > @rnd;

南…巷孤猫 2024-09-06 00:01:42

根据上面的 SQL 语句,无论 cur_odds 中的数字都不是每行被选择的概率,而只是一个任意权重(相对于“权重”)所有其他行),这可以最好地解释为向排序表顶部浮动的相对趋势。每行中的实际值是没有意义的(例如,您可以有 4 行,其值为 0.35、0.5、0.75 和 0.99,或者您可以有值为 35、50、75 和 99,结果将是相同的)。

更新:以下是您的查询所发生的情况。您有一行 cur_odds 值为 0.35。为了便于说明,我假设其他 9 行都具有相同的值 (0.072)。另外,为了便于说明,我们假设 RAND() 返回一个从 0.0 到 1.0 的值(实际上可能如此)。

每次运行此 SELECT 语句时,都会通过将其 cur_odds 值乘以 0.0 到 1.0 之间的 RAND() 值来为每行分配一个排序值。这意味着具有 0.35 的行的排序值将在 0.0 到 0.35 之间。

每隔一行(值为 0.072)的排序值将在 0.0 到 0.072 之间。这意味着您的一行有大约 80% 的机会具有大于 0.072 的排序值,这意味着任何其他行不可能可以排序得更高。这就是为什么 cur_odds 值为 0.35 的行首先出现的频率比您预期的要高。

我错误地将 cur_odds 值描述为相对变化权重。它实际上起到最大相对权重的作用,然后涉及一些复杂的数学来确定所涉及的实际相对概率。

我不确定你需要什么可以用直接的 T-SQL 来完成。我已经多次实现了加权概率选择器(讽刺的是,今天早上我什至想问一个关于最佳方法的问题),但总是在代码中。

Given your above SQL statement, whatever numbers you have in cur_odds are not the probabilities that each row is selected, but is instead just an arbitrary weighting (relative to the "weights" of all the other rows) which could instead be best interpreted as a relative tendency to float towards the top of the sorted table. The actual value in each row is meaningless (e.g. you could have 4 rows with values of 0.35, 0.5, 0.75 and 0.99, or you could have values of 35, 50, 75 and 99, and the results would be the same).

Update: Here's what's going on with your query. You have one row with a cur_odds value of 0.35. For the sake of illustration, I'm going to assume that the other 9 rows all have the same value (0.072). Also for the sake of illustration, let's assume RAND() returns a value from 0.0 to 1.0 (it may actually).

Every time you run this SELECT statement, each row is assigned a sorting value by multiplying its cur_odds value by a RAND() value from 0.0 to 1.0. This means that the row with a 0.35 will have a sorting value between 0.0 and 0.35.

Every other row (with a value of 0.072) will have sorting values ranging between 0.0 and 0.072. This means that there is an approximately 80% chance that your one row will have a sorting value greater than 0.072, which would mean that there is no possible chance that any other row could be sorted higher. This is why your row with the cur_odds value of 0.35 is coming up first more often than you expect.

I incorrectly described the cur_odds value as a relative change weighting. It actually functions as a maximum relative weighting, which would then involve some complex math to determine the actual relative probabilities involved.

I'm not sure what you need can be done with straight T-SQL. I've implemented a weighted probability picker many times (I was even going to ask a question about best methods for this this morning, ironically) but always in code.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文