用于在 MySQL 中获取随机行的无缝辅助 ID

发布于 2024-11-06 09:14:25 字数 1361 浏览 0 评论 0原文

在我从事的几个项目中,我遇到过需要从大型(> 1M 行)表中获取随机行的情况。对于这么大的表,ORDER BY rand() LIMIT 1 是不可行的,因为它会很快使数据库崩溃。

通常的解决方案是生成 MIN(id)MAX(id) 之间的随机数,然后直接选择该行。但是,如果 id 序列中存在很大的间隙,则需要进行大量重新滚动或使用 WHERE id >= :myrandomnumber 这将导致在较大间隙之后的行获得比平均的。

我一直在考虑通过创建一个仅用于随机化目的的新索引列(例如 id2)来解决此问题。该列始终是 1 和表中行数之间的无间隙序列。

问题:保持该序列无间隙的最佳方法是什么?

我想到的第一个解决方案是创建一个辅助表 recycled_ids,其中包含列 tablenameid2。每当从 tablename 中删除一行时,该行的 id2 就会插入到 recycled_ids 中。插入新行时,会从 recycled_ids 中选择 id2,或者如果没有可用的,则会创建一个新行。有没有更简单的方法?

额外问题:是否有 ORM 或框架已经做到了这一点,或者具有其他有效的随机行选择?这是现有的模式吗?它有名称吗?


更新:我为此编写了一个快速基准测试,并针对一个包含 125,000 行且行间有 30,000 个间隙的表运行它。结果非常有希望:

Fetch a random row 100 times using id2: 0.0234689712524 seconds
Fetch a random row 100 times using ORDER BY rand() LIMIT 1: 54.992347002 seconds

插入测试数据时,我每插入五行就删除一行随机行。该序列始终保持无间隙。

for($i=1; $i<=$amount; $i++) {
    insert_row();
    if($i % 5 == 0)
        delete_random_row();
}

在我的低端虚拟服务器上,使用 $amount = 10000 再次运行该循环需要 9 秒。每行 0.009 秒,其中包括每五次迭代删除一个随机行。随着表的增长,它确实会变慢,但获取随机行却不会。

我原来的问题仍然适用。

In several projects I've been working on I have encountered the need to fetch random rows from large (>1M rows) tables. With tables this large, ORDER BY rand() LIMIT 1 is no option as it will quickly bring the database to it's knees.

The usual solution has been to generat a random number between MIN(id) and MAX(id) and select that row directly. However, if there are big gaps in the id sequence this will require either lots of re-rolls or using WHERE id >= :myrandomnumber which will lead to rows that succeed large gaps getting significantly more hits than average.

I've been thinking to solve this problem by creating a new indexed column solely for randomizing purposes, say id2. This column would always be a gapless sequence between 1 and the number of rows in the table.

Question: What would be the best way to keep this sequence gapless?

The first solution that comes to mind is creating a helper table recycled_ids that will contain columns tablename and id2. Whenever a row is deleted from tablename, the id2 of that row is inserted to recycled_ids. When new rows are inserted, the id2 is either selected from recycled_ids or if none are available, a new one is created. Is there a simpler way?

Bonus questions: Are there ORMs or frameworks that already do this or have otherwise efficient random row selection? Is this an existing pattern, does it have a name?


Update: I wrote a quick benchmark for this and ran it against a table with 125,000 rows and 30,000 gaps between them. The results are pretty promising:

Fetch a random row 100 times using id2: 0.0234689712524 seconds
Fetch a random row 100 times using ORDER BY rand() LIMIT 1: 54.992347002 seconds

When inserting the test data, I removed one random row for every five rows inserted. The sequence stays gapless the whole time.

for($i=1; $i<=$amount; $i++) {
    insert_row();
    if($i % 5 == 0)
        delete_random_row();
}

Running that loop again with $amount = 10000 takes 9 seconds on my low-end vserver. That's 0.009 seconds per row and it includes deleting a random row every five iterations. It does get slower as the table grows, but fetching a random row does not.

My original questions still apply.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

烟酉 2024-11-13 09:14:25

这是我的做法 -

  1. 从表中选择 MAX(id)
  2. 在 PHP(或您使用的任何语言)中,生成 1 和 MAX(id) 之间的随机整数
  3. SELECT * FROM table WHERE id > ;= $random ORDER BY id ASC LIMIT 1
  4. 如果 3 没有返回任何内容,SELECT * FROM table WHERE id $random ORDER BY id DESC LIMIT 1

避免运行任何极其缓慢的查询。它还避免了额外的列,保持无间隙确实是一项令人讨厌的工作!

Here's how I'd do it -

  1. Select the MAX(id) from your table
  2. In PHP (or whatever language you're using), generate a random integer between 1 and MAX(id)
  3. SELECT * FROM table WHERE id >= $random ORDER BY id ASC LIMIT 1
  4. If 3 returns nothing, SELECT * FROM table WHERE id < $random ORDER BY id DESC LIMIT 1

Avoids running any queries that would be brutally slow. It also avoids the extra column which, keeping gapless, would be a nasty job indeed!

追我者格杀勿论 2024-11-13 09:14:25

我想说的是排名可以拯救。

SET @rank:= 1;  

SELECT * FROM
  (
  SELECT @rank:= @rank + 1 as rank, * FROM table1  
  ) s
WHERE s.rank = $random;

Ranking to the rescue I'd say.

SET @rank:= 1;  

SELECT * FROM
  (
  SELECT @rank:= @rank + 1 as rank, * FROM table1  
  ) s
WHERE s.rank = $random;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文