用于在 MySQL 中获取随机行的无缝辅助 ID
在我从事的几个项目中,我遇到过需要从大型(> 1M 行)表中获取随机行的情况。对于这么大的表,ORDER BY rand() LIMIT 1
是不可行的,因为它会很快使数据库崩溃。
通常的解决方案是生成 MIN(id)
和 MAX(id)
之间的随机数,然后直接选择该行。但是,如果 id 序列中存在很大的间隙,则需要进行大量重新滚动或使用 WHERE id >= :myrandomnumber
这将导致在较大间隙之后的行获得比平均的。
我一直在考虑通过创建一个仅用于随机化目的的新索引列(例如 id2
)来解决此问题。该列始终是 1 和表中行数之间的无间隙序列。
问题:保持该序列无间隙的最佳方法是什么?
我想到的第一个解决方案是创建一个辅助表 recycled_ids
,其中包含列 tablename
和 id2
。每当从 tablename
中删除一行时,该行的 id2
就会插入到 recycled_ids
中。插入新行时,会从 recycled_ids
中选择 id2
,或者如果没有可用的,则会创建一个新行。有没有更简单的方法?
额外问题:是否有 ORM 或框架已经做到了这一点,或者具有其他有效的随机行选择?这是现有的模式吗?它有名称吗?
更新:我为此编写了一个快速基准测试,并针对一个包含 125,000 行且行间有 30,000 个间隙的表运行它。结果非常有希望:
Fetch a random row 100 times using id2: 0.0234689712524 seconds
Fetch a random row 100 times using ORDER BY rand() LIMIT 1: 54.992347002 seconds
插入测试数据时,我每插入五行就删除一行随机行。该序列始终保持无间隙。
for($i=1; $i<=$amount; $i++) {
insert_row();
if($i % 5 == 0)
delete_random_row();
}
在我的低端虚拟服务器上,使用 $amount = 10000
再次运行该循环需要 9 秒。每行 0.009 秒,其中包括每五次迭代删除一个随机行。随着表的增长,它确实会变慢,但获取随机行却不会。
我原来的问题仍然适用。
In several projects I've been working on I have encountered the need to fetch random rows from large (>1M rows) tables. With tables this large, ORDER BY rand() LIMIT 1
is no option as it will quickly bring the database to it's knees.
The usual solution has been to generat a random number between MIN(id)
and MAX(id)
and select that row directly. However, if there are big gaps in the id sequence this will require either lots of re-rolls or using WHERE id >= :myrandomnumber
which will lead to rows that succeed large gaps getting significantly more hits than average.
I've been thinking to solve this problem by creating a new indexed column solely for randomizing purposes, say id2
. This column would always be a gapless sequence between 1 and the number of rows in the table.
Question: What would be the best way to keep this sequence gapless?
The first solution that comes to mind is creating a helper table recycled_ids
that will contain columns tablename
and id2
. Whenever a row is deleted from tablename
, the id2
of that row is inserted to recycled_ids
. When new rows are inserted, the id2
is either selected from recycled_ids
or if none are available, a new one is created. Is there a simpler way?
Bonus questions: Are there ORMs or frameworks that already do this or have otherwise efficient random row selection? Is this an existing pattern, does it have a name?
Update: I wrote a quick benchmark for this and ran it against a table with 125,000 rows and 30,000 gaps between them. The results are pretty promising:
Fetch a random row 100 times using id2: 0.0234689712524 seconds
Fetch a random row 100 times using ORDER BY rand() LIMIT 1: 54.992347002 seconds
When inserting the test data, I removed one random row for every five rows inserted. The sequence stays gapless the whole time.
for($i=1; $i<=$amount; $i++) {
insert_row();
if($i % 5 == 0)
delete_random_row();
}
Running that loop again with $amount = 10000
takes 9 seconds on my low-end vserver. That's 0.009 seconds per row and it includes deleting a random row every five iterations. It does get slower as the table grows, but fetching a random row does not.
My original questions still apply.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是我的做法 -
SELECT * FROM table WHERE id > ;= $random ORDER BY id ASC LIMIT 1
SELECT * FROM table WHERE id
$random ORDER BY id DESC LIMIT 1
避免运行任何极其缓慢的查询。它还避免了额外的列,保持无间隙确实是一项令人讨厌的工作!
Here's how I'd do it -
SELECT * FROM table WHERE id >= $random ORDER BY id ASC LIMIT 1
SELECT * FROM table WHERE id < $random ORDER BY id DESC LIMIT 1
Avoids running any queries that would be brutally slow. It also avoids the extra column which, keeping gapless, would be a nasty job indeed!
我想说的是排名可以拯救。
Ranking to the rescue I'd say.