My(SQL)选择随机行，新颖的方式，有助于评估它有多好？

发布于 2024-10-21 09:58:58 字数 2080 浏览 2 评论 0原文

我再次遇到选择随机行子集的问题。许多人可能都知道 ORDER BY RAND() 效率相当低，或者至少这是共识。我读过mysql为表中的每一行生成随机值，然后过滤然后按这些随机值排序，然后返回集合。最大的性能影响似乎来自于这样一个事实：表中的行数与生成的随机数一样多。所以我正在寻找可能更好的方法来返回此类查询的结果的随机子集：

SELECT id FROM <table> WHERE <some conditions> LIMIT 10;

当然，做我想做的事情的最简单和最简单的方法将是我试图避免的一个女巫：

SELECT id FROM <table> WHERE <some conditions> ORDER BY RAND() LIMIT 10; (a)

现在经过一番思考，我想出了其他选择这个任务：（

SELECT id FROM <table> WHERE (<some conditions>) AND RAND() > x LIMIT 10; (b)

当然我们可以使用 < 而不是 >）这里我们从 0.0 - 1.0< 范围内获取 x /代码>。现在我不太确定 MySQL 如何处理这个问题，但我的猜测是它首先选择匹配的行（使用索引[es]？），然后生成随机值并查看它是否应该返回或丢弃行。但我知道什么:)这就是我在这里问的原因。因此，关于此方法的一些观察：

首先，即使匹配行数比需要的多得多，它也不能保证返回所请求的行数，尤其是对于接近 1.0< 的 x 值。 /code> （如果我们使用 <，则接近 0.0）
返回的对象实际上并没有随机排序，它们只是随机选择的对象，按使用的索引排序或顺便说一下它们的存储方式（？）（当然，在某些情况下这可能根本不重要）
我们可能需要根据结果集的大小选择 x ，因为如果我们有很大的结果集并且x 可以说是 0.1，大多数时候很可能只返回一些随机的第一个结果；另一方面，如果结果集较小并选择较大的x，那么我们可能会得到比我们想要的更少的对象，尽管它们已经足够了

现在一些关于性能的问题。我使用 jmeter 做了一些测试。 大约有 20k 行，大约有 2-3k 行匹配。我编写了简单的 PHP 脚本来执行查询并打印结果。然后我使用启动 200 个线程的 jmeter 设置测试，因此每秒有 200 个请求，并且请求表示 PHP 脚本。我运行它直到完成大约 3k 个请求（平均响应时间在此之前就稳定下来了）。此外，我还使用 SQL_NO_CACHE 执行所有查询，以防止查询缓存产生一些影响。平均响应时间为：

查询 (a) 约为 30 毫秒
x = 0.1 查询 (b) 为
13-15 毫秒 x = 0.9 查询 (b) 为 17-20 毫秒，正如预期的那样，较大的 x 速度较慢，因为它必须丢弃更多行

所以我的问题是：您对这种选择随机行的方法有何看法？也许您已经使用过或尝试过但发现效果不佳？也许你能更好地解释MySQL如何处理这样的查询？有哪些我不知道的警告？

编辑：我可能还不够清楚，关键是我需要随机的查询行而不仅仅是表格，因此我包含了并且有相当多的条件。此外，表中的 id 肯定有间隙，但这并不重要，因为这不是来自表的随机行，而是来自查询的随机行，并且会有相当多的此类查询，因此那些涉及多次查询表的建议听起来并不吸引人。 <某些条件> 在请求之间至少会有所不同，这意味着将存在具有不同条件的请求。

原文

I again run into problem of selecting random subset of rows. And as many probably know ORDER BY RAND() is quite inefficient, or at least thats the consensus. I have read that mysql generates random value for every row in table, then filters then orders by these random values and then returns set. The biggest performance impact seems to be from the fact that there as many random numbers generated as there are rows in a table. So i was looking for possibly better way to return random subset of results for such query:

SELECT id FROM <table> WHERE <some conditions> LIMIT 10;

Of course simplest and easiest way to do what i want would be the one witch I try to avoid:

SELECT id FROM <table> WHERE <some conditions> ORDER BY RAND() LIMIT 10; (a)

Now after some thinking i came up with other option for this task:

SELECT id FROM <table> WHERE (<some conditions>) AND RAND() > x LIMIT 10; (b)

(Of course we can use < instead of >) Here we take x from range 0.0 - 1.0. Now I'm not exactly sure how MySQL handles this but my guess is that it first selects rows matching <some conditions> (using index[es]?) and then generates random value and sees if it should return or discard row. But what do i know:) thats why i ask here. So some observations about this method:

first it does not guarantee that asked number of rows will be returned even if there is much more matching rows than needed, especially true for x values close to 1.0 (or close to 0.0 if we use <)
returned object don't really have random ordering, they are just objects selected randomly, order by index used or by the way they are stored(?) (of course this might not matter in some cases at all)
we probably need to choose x according to size of result set, since if we have large result set and x is lets say 0.1, it will be very likely that only some random first results will be returned most of the time; on the other hand if have small result set and choose large x it is likely that we might get less object than we want, although there are enough of them

Now some words about performance. I did a little testing using jmeter. <table> has about 20k rows, and there are about 2-3k rows matching <some conditions>. I wrote simple PHP script that executes query and print_r's the result. Then I setup test using jmeter that starts 200 threads, so that is 200 requests per second, and requests said PHP script. I ran it until about 3k requests were done (average response time stabilizes well before this). Also I executed all queries with SQL_NO_CACHE to prevent query cache having some effect. Average response times were:

~30ms for query (a)
13-15ms for query (b) with x = 0.1
17-20ms for query (b) with x = 0.9, as expected larger x is slower since it has to discard more rows

So my questions are: what do you think about this method of selecting random rows? Maybe you have used it or tried it and see that it did not work out? Maybe you can better explain how MySQL handles such query? What could be some caveats that I'm not aware of?

EDIT: I probably was not clear enough, the point is that i need random rows of query not simply table, thus I included <some conditions> and there are quite some. Moreover table is guaranteed to have gaps in id, not that it matters much since this is not random rows from table but from query, and there will be quite a lot such queries so those suggestions involving querying table multiple times do not sound appealing. <some conditions> will vary at least a bit between requests, meaning that there will be requests with different conditions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你是暖光i 2024-10-28 09:58:58

根据我自己的经验，我一直使用 ORDER BY RAND()，但这对较大的数据集有其自身的性能影响。例如，如果您有一个表太大而无法放入内存，那么 MySQL 将在磁盘上创建一个临时表，然后执行文件排序以随机化数据集（存储引擎允许）。您的 LIMIT 10 子句不会对查询的执行时间产生任何影响，但它会明显减少发送回客户端的数据大小。

基本上，限制和排序发生在查询执行之后（全表扫描以查找匹配记录，然后排序，然后限制）。 LIMIT 10 子句之外的任何行都将被丢弃。

作为旁注，添加 SQL_NO_CACHE 将禁用 MySQL 的内部查询缓存，但不会阻止您的操作系统缓存数据（由于此查询的随机性，我认为它不会对无论如何你的执行时间）。

如果我犯了任何错误，希望有人能在这里纠正我，但我相信这是总体思路。

回复收藏 0 原文

人│生佛魔见 2024-10-28 09:58:58

另一种方法可能不会更快，但谁知道呢:)

要么使用表状态查询来确定下一个 auto_increment 或行计数，要么使用 (select count(*))。然后您可以提前决定随机项目的 auto_increment 值，然后选择该唯一项目。

如果 auto_increment 字段中有间隙，这将会失败，但如果它比其他方法更快，您可以循环几次，或者在返回零行的情况下回退到故障安全方法。最好的情况可能是节省大量资金，最坏的情况将与您当前的方法相当。

回复收藏 0 原文

無處可尋 2024-10-28 09:58:58

您使用了错误的数据结构。

通常的方法是这样的：

找出元素的数量n——类似于SELECT count(id) FROM tablename。
在区间 [0,n) 中选择 r 个不同的随机数。我通常推荐使用适当选择的参数的 LCG，但简单地选择 r 随机数字并丢弃重复也是可行的。
返回这些元素。最难的一点。

MySQL 似乎支持索引查找，例如 SELECT id from tablename ORDER BY id LIMIT :i,1 ，其中 :i 是绑定参数（我忘记了 mysqli 使用什么语法）；替代语法LIMIT 1 OFFSET :i。这意味着您必须进行 r 查询，但这可能足够快（这取决于每个语句的开销以及它执行 OFFSET 的效率）。

另一种方法应该适用于大多数数据库，有点像间隔二分：

SELECT count(id),max(id),min(id) FROM tablename。然后你知道行 [0,n-1] 的 id 为 [min,max]。
所以行 [a,b] 的 id 为 [min,max]。你想要第 i 行。如果 i == a，则返回最小值。如果 i == b，则返回最大值。否则，平分：
1. 选择split = min+(max-min)/2（避免整数溢出）。
2. 选择计数（id），最大（id），其中：min < id 和 id < split 和 SELECT count(id),min(id) WHERE :split <= id 和 id < ：最大。如果表没有被修改，这两个计数应该等于 b-a+1...
3. 找出 i 所在的范围，并适当更新 a、b、min 和 max。重复。

有很多边缘情况（我可能包含了一些相差一的错误）和一些潜在的优化（您可以一次对所有索引执行此操作，并且您实际上不需要每次迭代执行两次查询如果您不假设 i == b 暗示 id = max）。如果 SELECT ... OFFSET 的效率甚至是模糊的，那么就不值得这样做。