使用 Django QuerySet 分块处理数据库的最佳方法?
我正在对数据库中的所有行运行批处理操作。这涉及选择每个模型并对其进行处理。将其分成多个块并逐块执行是有意义的。
我目前正在使用Paginator,因为它很方便。这意味着我需要对值进行排序,以便可以按顺序对它们进行分页。这确实会生成具有 order
和 limit
子句的 SQL 语句,并且对于每个块,我认为 Postgres 可能会对整个表进行排序(尽管我不能声称拥有任何知识)关于内部结构)。我所知道的是数据库的 CPU 利用率约为 50%,我认为对于执行 select
来说这个值太高了。
以 RDMBS/CPU 友好的方式迭代整个表的最佳方法是什么?
假设在批处理操作期间数据库的内容没有改变。
I am running a batch operation over all rows in a database. This involves selecting every single model and doing something to it. It makes sense to split this into chunks and do it chunk by chunk.
I'm currently using Paginator, because it's convenient. This means that I need an ordering on the values so that they can be paged through in order. This does generate SQL statements which have order
and limit
clauses, and for each chunk I think Postgres may be sorting the whole table (although I can't claim to have any knowledge about internals). All I know is that the database is at around 50% CPU and I think that's too high just to be doing select
s.
What's the best way to iterate over the entire table in a RDMBS/CPU-friendly way?
Assuming that the contents of the database doesn't change during the batch operation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据您的描述,您实际上并不关心您处理的行的排序顺序。如果表中有主键(我希望如此!),这种粗略的分区方法会快得多:
这对于任何偏移量执行相同的操作,并且对于任何大小的表(几乎)执行相同的操作。
相应地检索主键和分区的最小值和最大值:
与此相反:
这通常会更慢,因为必须对所有行进行排序,并且随着偏移量更高和表更大,性能还会降低。
From your description you don't actually care about the sort order of the rows you process. If you have primary keys in your tables (which I expect!), this crude method of partitioning would be much faster:
This performs the same for any offset and (almost) the same for any size of table.
Retrieve min and max of your primary key and partition accordingly:
As opposed to:
This is generally slower because all rows have to be sorted and performance degrades additionally with higher offsets and bigger tables.
以下代码针对 Django 查询集实现了 Erwin 的上述答案(使用 BETWEEN):
将为任意 Django 查询集执行此操作的实用程序函数如下。它默认假设“id”是用于
Between 子句的合适字段。
它的使用方式如下:
您还可以更改接口,以便不需要额外的嵌套循环,但可以执行
for item in chunked_queryset(qs)
:The following code implements Erwin's answer above (using
BETWEEN
) for a Django QuerySet:A utility function that will do this for an arbitrary Django QuerySet is as follows. It defaults to assuming 'id' is a suitable field to use for the
between
clause.It would be used like this:
You could also change the interface so that you didn't need the extra nested loop, but could do
for item in chunked_queryset(qs)
: