Postgres 中的快速随机行选择
我在 postgres 中有一个包含数百万行的表。我在互联网上检查过,发现以下
SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;
内容有效,但速度非常慢...是否有另一种方法来进行该查询,或者直接选择随机行而不读取所有表?顺便说一句,“myid”是一个整数,但它可以是一个空字段。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
您可能想尝试使用
OFFSET
,如N
是mytable
中的行数。您可能需要首先执行SELECT COUNT(*)
来计算出N
的值。更新(作者:Antony Hatchkins)
您必须在此处使用
floor
:考虑一个 2 行的表;
random()*N
生成0 <= x
2
,例如SELECT myid FROM mytable OFFSET 1.7 LIMIT 1;
由于隐式舍入到最接近的 int,因此返回 0 行。You might want to experiment with
OFFSET
, as inThe
N
is the number of rows inmytable
. You may need to first do aSELECT COUNT(*)
to figure out the value ofN
.Update (by Antony Hatchkins)
You must use
floor
here:Consider a table of 2 rows;
random()*N
generates0 <= x < 2
and for exampleSELECT myid FROM mytable OFFSET 1.7 LIMIT 1;
returns 0 rows because of implicit rounding to nearest int.PostgreSQL 9.5 引入了一种更快的样本选择新方法:TABLESAMPLE
语法是
如果您只想选择一行,这不是最佳解决方案,因为您需要知道表的 COUNT 来计算准确的百分比。
为了避免缓慢的 COUNT 并对从 1 行到数十亿行的表使用快速的 TABLESAMPLE,您可以这样做:
这可能看起来不那么优雅,但可能比任何其他答案都快。
要决定是否要使用 BERNULLI 或 SYSTEM,请阅读 https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/
PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE
The syntax is
This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.
To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:
This might not look so elegant, but probably is faster than any of the other answers.
To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/
我用子查询尝试了这个,效果很好。 Offset,至少在 Postgresql v8.4.4 中工作得很好。
I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.
您需要使用
地板
:You need to use
floor
:查看此链接以了解一些不同的选项。
http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/" depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
更新: ( A.Hatchkins)
这篇(非常)长的文章的摘要如下。
作者列出了四种方法:
1)
ORDER BY random() LIMIT 1;
-- 慢2)
ORDER BY id where id>=random()*N LIMIT 1
- - 如果存在间隙,则不均匀3) 随机列 - 需要时不时更新
4) 自定义 随机聚合 -- 狡猾的方法,可能会很慢:random() 需要生成 N 次
,建议使用
5)
ORDER 改进方法 #2 BY id,其中 id=random()*N LIMIT 1
如果结果为空,则进行后续重新查询。
Check this link out for some different options.
http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
Update: (A.Hatchkins)
The summary of the (very) long article is as follows.
The author lists four approaches:
1)
ORDER BY random() LIMIT 1;
-- slow2)
ORDER BY id where id>=random()*N LIMIT 1
-- nonuniform if there're gaps3) random column -- needs to be updated every now and then
4) custom random aggregate -- cunning method, could be slow: random() needs to be generated N times
and suggests to improve method #2 by using
5)
ORDER BY id where id=random()*N LIMIT 1
with subsequent requeries if the result is empty.
获取随机行的最简单、最快的方法是使用 tsm_system_rows 扩展:
然后您可以选择所需的确切行数:
这在 PostgreSQL 9.5 及更高版本中可用。
请参阅:https://www.postgresql.org/docs/current /static/tsm-system-rows.html
The easiest and fastest way to fetch random row is to use the
tsm_system_rows
extension :Then you can select the exact number of rows you want :
This is available with PostgreSQL 9.5 and later.
See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html
我想出了一个非常快速的解决方案,无需
TABLESAMPLE
。比OFFSET random()*N LIMIT 1
快得多。它甚至不需要表数。这个想法是使用随机但可预测的数据创建表达式索引,例如
md5(主键)
。下面是使用 1M 行样本数据进行的测试:
结果:
此查询有时(大约有 1/Number_of_rows 概率)返回 0 行,因此需要检查并重新运行。而且概率也不完全相同 - 有些行比其他行更有可能。
作为比较:
结果差异很大,但可能非常糟糕:
I've came up with a very fast solution without
TABLESAMPLE
. Much faster thanOFFSET random()*N LIMIT 1
. It doesn't even require table count.The idea is to create an expression index with random but predictable data, for example
md5(primary key)
.Here is a test with 1M rows sample data:
Result:
This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.
For comparison:
Results vary widely, but can be pretty bad:
我向每一行添加了一个随机生成的数字,并用我的编程语言生成了一个添加到每一行的随机数。
调用时,我将一个随机数传递给查询(在本例中为 0.27)
(查询取自 此处)
如果您在条件中的行和随机行(包含随机数)上有一个索引,我将在 6 毫秒内得到 850 万行表的结果。这比使用 order by random() 之类的方法要快几个数量级。
为了提高随机性,您还可以为您命中的每个结果生成一个新的随机数。 (如果没有这个,某些数字将比其他数字更频繁地出现。)
与 TABLESAMPLE 不同,它也支持条件。
I added a randomly generated number to each row and generate a random number in my programming language that is added to each row.
When calling, I pass a random number to the query (in this case 0.27)
(Query taken from here)
If you have an index here on your the rows in your condition and the random row (containing the random numbers), I get a result in 6 ms on my 8.5 million row table. This is orders of magnitude faster than using anything like order by random().
To improve randomness, you can also generate a new random number for each result you have hit. (Without this some number will occur more often than others.)
Unlike TABLESAMPLE this also supports conditions.