postgresql: 偏移量 +限制变得非常慢

发布于 2024-12-13 05:29:34 字数 632 浏览 2 评论 0原文

我有一个表 tmp_drop_ids,其中包含一列 id 和 330 万个条目。我想迭代该表,每 200 个条目执行一些操作。我有这样的代码:

LIMIT = 200
for offset in xrange(0, drop_count+LIMIT, LIMIT):
    print "Making tmp table with ids %s to %s/%s" % (offset, offset+LIMIT, drop_count)
    query = """DROP TABLE IF EXISTS tmp_cur_drop_ids; CREATE TABLE tmp_cur_drop_ids AS
    SELECT id FROM tmp_drop_ids ORDER BY id OFFSET %s LIMIT %s;""" % (offset, LIMIT)
    cursor.execute(query)

一开始运行得很好(〜0.15秒生成临时表),但它偶尔会变慢,例如大约300k票,它开始花费11-12秒来生成这个临时表,然后再次大约40万。基本上看起来不可靠。

我将在其他查询中使用这些 id,因此我认为放置它们的最佳位置是在 tmp 表中。有没有更好的方法来迭代这样的结果?

I have a table tmp_drop_ids with one column, id, and 3.3 million entries. I want to iterate over the table, doing something with every 200 entries. I have this code:

LIMIT = 200
for offset in xrange(0, drop_count+LIMIT, LIMIT):
    print "Making tmp table with ids %s to %s/%s" % (offset, offset+LIMIT, drop_count)
    query = """DROP TABLE IF EXISTS tmp_cur_drop_ids; CREATE TABLE tmp_cur_drop_ids AS
    SELECT id FROM tmp_drop_ids ORDER BY id OFFSET %s LIMIT %s;""" % (offset, LIMIT)
    cursor.execute(query)

This runs fine, at first, (~0.15s to generate the tmp table), but it will slow down occasionally, e.g. around 300k tickets it started taking 11-12 seconds to generate this tmp table, and again around 400k. It basically seems unreliable.

I will use those ids in other queries so I figured the best place to have them was in a tmp table. Is there any better way to iterate through results like this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

百善笑为先 2024-12-20 05:29:34

请改用光标。使用 OFFSET 和 LIMIT 非常昂贵 - 因为 pg 必须执行查询、处理和跳过 OFFSET 行。 OFFSET 就像“跳过行”,这是昂贵的。

游标文档

游标允许对一个查询进行迭代。

BEGIN
DECLARE C CURSOR FOR SELECT * FROM big_table;
FETCH 300 FROM C; -- get 300 rows
FETCH 300 FROM C; -- get 300 rows
...
COMMIT;

也许您可以使用服务器端游标,而无需显式使用 DECLARE 语句,只需 psycopg(有关服务器端光标的搜索部分)。

Use a cursor instead. Using a OFFSET and LIMIT is pretty expensive - because pg has to execute query, process and skip a OFFSET rows. OFFSET is like "skip rows", that is expensive.

cursor documentation

Cursor allows a iteration over one query.

BEGIN
DECLARE C CURSOR FOR SELECT * FROM big_table;
FETCH 300 FROM C; -- get 300 rows
FETCH 300 FROM C; -- get 300 rows
...
COMMIT;

Probably you can use a server side cursor without explicit using of DECLARE statement, just with support in psycopg (search section about server side cursors).

余厌 2024-12-20 05:29:34

如果你的 id 已建立索引,你可以使用“limit”和“>”,例如在类似 python 的伪代码中:

limit=200
max_processed_id=-1
query ("create table tmp_cur_drop_ids(id int)")
while true:
  query("truncate tmp_cur_drop_ids")
  query("insert into tmp_cur_drop_ids(id)" \
        + " select id from tmp_drop_ids" \
        + " where id>%d order by id limit %d" % (max_processed_id, limit))
  max_processed_id = query("select max(id) from tmp_cur_drop_ids")
  if max_processed_id == None:
    break
  process_tmp_cur_drop_ids();
query("drop table tmp_cur_drop_ids")

这样 Postgres 就可以使用索引进行查询。

If your id's are indexed you can use "limit" with ">", for example in python-like pseudocode:

limit=200
max_processed_id=-1
query ("create table tmp_cur_drop_ids(id int)")
while true:
  query("truncate tmp_cur_drop_ids")
  query("insert into tmp_cur_drop_ids(id)" \
        + " select id from tmp_drop_ids" \
        + " where id>%d order by id limit %d" % (max_processed_id, limit))
  max_processed_id = query("select max(id) from tmp_cur_drop_ids")
  if max_processed_id == None:
    break
  process_tmp_cur_drop_ids();
query("drop table tmp_cur_drop_ids")

This way Postgres can use index for your query.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文