Postgres 中的快速随机行选择

发布于 2024-10-21 23:34:15 字数 208 浏览 2 评论 0 原文

我在 postgres 中有一个包含数百万行的表。我在互联网上检查过,发现以下

SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;

内容有效,但速度非常慢...是否有另一种方法来进行该查询,或者直接选择随机行而不读取所有表?顺便说一句,“myid”是一个整数,但它可以是一个空字段。

I have a table in postgres that contains couple of millions of rows. I have checked on the internet and I found the following

SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;

It works, but it's really slow... is there another way to make that query, or a direct way to select a random row without reading all the table? By the way 'myid' is an integer but it can be an empty field.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

感情旳空白 2024-10-28 23:34:15

您可能想尝试使用 OFFSET,如

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

Nmytable 中的行数。您可能需要首先执行 SELECT COUNT(*) 来计算出 N 的值。

更新(作者:Antony Hatchkins)

您必须在此处使用floor

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

考虑一个 2 行的表; random()*N 生成 0 <= x 2 ,例如 SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; 由于隐式舍入到最接近的 int,因此返回 0 行。

You might want to experiment with OFFSET, as in

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

The N is the number of rows in mytable. You may need to first do a SELECT COUNT(*) to figure out the value of N.

Update (by Antony Hatchkins)

You must use floor here:

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

Consider a table of 2 rows; random()*N generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; returns 0 rows because of implicit rounding to nearest int.

冷了相思 2024-10-28 23:34:15

PostgreSQL 9.5 引入了一种更快的样本选择新方法:TABLESAMPLE

语法是

SELECT * FROM my_table TABLESAMPLE BERNOULLI(percentage);
SELECT * FROM my_table TABLESAMPLE SYSTEM(percentage);

如果您只想选择一行,这不是最佳解决方案,因为您需要知道表的 COUNT 来计算准确的百分比。

为了避免缓慢的 COUNT 并对从 1 行到数十亿行的表使用快速的 TABLESAMPLE,您可以这样做:

 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.000001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.00001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.0001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.001) LIMIT 1;
 ...

这可能看起来不那么优雅,但可能比任何其他答案都快。

要决定是否要使用 BERNULLI 或 SYSTEM,请阅读 https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/

PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE

The syntax is

SELECT * FROM my_table TABLESAMPLE BERNOULLI(percentage);
SELECT * FROM my_table TABLESAMPLE SYSTEM(percentage);

This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.

To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:

 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.000001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.00001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.0001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.001) LIMIT 1;
 ...

This might not look so elegant, but probably is faster than any of the other answers.

To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/

宛菡 2024-10-28 23:34:15

我用子查询尝试了这个,效果很好。 Offset,至少在 Postgresql v8.4.4 中工作得很好。

select * from mytable offset random() * (select count(*) from mytable) limit 1 ;

I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.

select * from mytable offset random() * (select count(*) from mytable) limit 1 ;
西瓜 2024-10-28 23:34:15

您需要使用地板

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

You need to use floor:

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;
浊酒尽余欢 2024-10-28 23:34:15

查看此链接以了解一些不同的选项。
http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/" depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

更新: ( A.Hatchkins)

这篇(非常)长的文章的摘要如下。

作者列出了四种方法:

1) ORDER BY random() LIMIT 1; -- 慢

2) ORDER BY id where id>=random()*N LIMIT 1 - - 如果存在间隙,则不均匀

3) 随机列 - 需要时不时更新

4) 自定义 随机聚合 -- 狡猾的方法,可能会很慢:random() 需要生成 N 次

,建议使用

5) ORDER 改进方法 #2 BY id,其中 id=random()*N LIMIT 1
如果结果为空,则进行后续重新查询。

Check this link out for some different options.
http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

Update: (A.Hatchkins)

The summary of the (very) long article is as follows.

The author lists four approaches:

1) ORDER BY random() LIMIT 1; -- slow

2) ORDER BY id where id>=random()*N LIMIT 1 -- nonuniform if there're gaps

3) random column -- needs to be updated every now and then

4) custom random aggregate -- cunning method, could be slow: random() needs to be generated N times

and suggests to improve method #2 by using

5) ORDER BY id where id=random()*N LIMIT 1
with subsequent requeries if the result is empty.

妞丶爷亲个 2024-10-28 23:34:15

获取随机行的最简单、最快的方法是使用 tsm_system_rows 扩展:

CREATE EXTENSION IF NOT EXISTS tsm_system_rows;

然后您可以选择所需的确切行数:

SELECT myid  FROM mytable TABLESAMPLE SYSTEM_ROWS(1);

这在 PostgreSQL 9.5 及更高版本中可用。

请参阅:https://www.postgresql.org/docs/current /static/tsm-system-rows.html

The easiest and fastest way to fetch random row is to use the tsm_system_rows extension :

CREATE EXTENSION IF NOT EXISTS tsm_system_rows;

Then you can select the exact number of rows you want :

SELECT myid  FROM mytable TABLESAMPLE SYSTEM_ROWS(1);

This is available with PostgreSQL 9.5 and later.

See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html

旧伤慢歌 2024-10-28 23:34:15

我想出了一个非常快速的解决方案,无需 TABLESAMPLE。比 OFFSET random()*N LIMIT 1 快得多。它甚至不需要表数。

这个想法是使用随机但可预测的数据创建表达式索引,例如md5(主键)

下面是使用 1M 行样本数据进行的测试:

create table randtest (id serial primary key, data int not null);

insert into randtest (data) select (random()*1000000)::int from generate_series(1,1000000);

create index randtest_md5_id_idx on randtest (md5(id::text));

explain analyze
select * from randtest where md5(id::text)>md5(random()::text)
order by md5(id::text) limit 1;

结果:

 Limit  (cost=0.42..0.68 rows=1 width=8) (actual time=6.219..6.220 rows=1 loops=1)
   ->  Index Scan using randtest_md5_id_idx on randtest  (cost=0.42..84040.42 rows=333333 width=8) (actual time=6.217..6.217 rows=1 loops=1)
         Filter: (md5((id)::text) > md5((random())::text))
         Rows Removed by Filter: 1831
 Total runtime: 6.245 ms

此查询有时(大约有 1/Number_of_rows 概率)返回 0 行,因此需要检查并重新运行。而且概率也不完全相同 - 有些行比其他行更有可能。

作为比较:

explain analyze SELECT id FROM randtest OFFSET random()*1000000 LIMIT 1;

结果差异很大,但可能非常糟糕:

 Limit  (cost=1442.50..1442.51 rows=1 width=4) (actual time=179.183..179.184 rows=1 loops=1)
   ->  Seq Scan on randtest  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.016..134.835 rows=915702 loops=1)
 Total runtime: 179.211 ms
(3 rows)

I've came up with a very fast solution without TABLESAMPLE. Much faster than OFFSET random()*N LIMIT 1. It doesn't even require table count.

The idea is to create an expression index with random but predictable data, for example md5(primary key).

Here is a test with 1M rows sample data:

create table randtest (id serial primary key, data int not null);

insert into randtest (data) select (random()*1000000)::int from generate_series(1,1000000);

create index randtest_md5_id_idx on randtest (md5(id::text));

explain analyze
select * from randtest where md5(id::text)>md5(random()::text)
order by md5(id::text) limit 1;

Result:

 Limit  (cost=0.42..0.68 rows=1 width=8) (actual time=6.219..6.220 rows=1 loops=1)
   ->  Index Scan using randtest_md5_id_idx on randtest  (cost=0.42..84040.42 rows=333333 width=8) (actual time=6.217..6.217 rows=1 loops=1)
         Filter: (md5((id)::text) > md5((random())::text))
         Rows Removed by Filter: 1831
 Total runtime: 6.245 ms

This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.

For comparison:

explain analyze SELECT id FROM randtest OFFSET random()*1000000 LIMIT 1;

Results vary widely, but can be pretty bad:

 Limit  (cost=1442.50..1442.51 rows=1 width=4) (actual time=179.183..179.184 rows=1 loops=1)
   ->  Seq Scan on randtest  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.016..134.835 rows=915702 loops=1)
 Total runtime: 179.211 ms
(3 rows)
终遇你 2024-10-28 23:34:15

我向每一行添加了一个随机生成的数字,并用我的编程语言生成了一个添加到每一行的随机数。
调用时,我将一个随机数传递给查询(在本例中为 0.27)

SELECT * FROM
(
  (SELECT id, random FROM t where <condition> and random >= 0.27 ORDER BY random LIMIT 1)
  UNION ALL
  (SELECT id, random FROM t where <condition> and random < 0.27 ORDER BY random DESC LIMIT 1)
) as results
ORDER BY abs(0.27-random) LIMIT 1;

(查询取自 此处

如果您在条件中的行和随机行(包含随机数)上有一个索引,我将在 6 毫秒内得到 850 万行表的结果。这比使用 order by random() 之类的方法要快几个数量级。

为了提高随机性,您还可以为您命中的每个结果生成一个新的随机数。 (如果没有这个,某些数字将比其他数字更频繁地出现。)

与 TABLESAMPLE 不同,它也支持条件。

I added a randomly generated number to each row and generate a random number in my programming language that is added to each row.
When calling, I pass a random number to the query (in this case 0.27)

SELECT * FROM
(
  (SELECT id, random FROM t where <condition> and random >= 0.27 ORDER BY random LIMIT 1)
  UNION ALL
  (SELECT id, random FROM t where <condition> and random < 0.27 ORDER BY random DESC LIMIT 1)
) as results
ORDER BY abs(0.27-random) LIMIT 1;

(Query taken from here)

If you have an index here on your the rows in your condition and the random row (containing the random numbers), I get a result in 6 ms on my 8.5 million row table. This is orders of magnitude faster than using anything like order by random().

To improve randomness, you can also generate a new random number for each result you have hit. (Without this some number will occur more often than others.)

Unlike TABLESAMPLE this also supports conditions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文