为什么迭代大型 Django QuerySet 会消耗大量内存?
该表包含大约一千万行。
for event in Event.objects.all():
print event
这会导致内存使用量稳定增加到 4 GB 左右,此时行会快速打印。第一行打印之前的漫长延迟让我感到惊讶——我预计它几乎会立即打印。
我还尝试了 Event.objects.iterator()
,其行为方式相同。
我不明白 Django 正在将什么加载到内存中,也不明白它为什么要这样做。我期望 Django 在数据库级别迭代结果,这意味着结果将以大致恒定的速率打印(而不是在漫长的等待后一次全部打印)。
我误解了什么?
(我不知道这是否相关,但我正在使用 PostgreSQL。)
The table in question contains roughly ten million rows.
for event in Event.objects.all():
print event
This causes memory usage to increase steadily to 4 GB or so, at which point the rows print rapidly. The lengthy delay before the first row printed surprised me – I expected it to print almost instantly.
I also tried Event.objects.iterator()
which behaved the same way.
I don't understand what Django is loading into memory or why it is doing this. I expected Django to iterate through the results at the database level, which'd mean the results would be printed at roughly a constant rate (rather than all at once after a lengthy wait).
What have I misunderstood?
(I don't know whether it's relevant, but I'm using PostgreSQL.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
Nate C 很接近,但还不够。
来自文档:
因此,当您第一次进入该循环并获取查询集的迭代形式时,会立即检索一千万行。您所经历的等待是 Django 加载数据库行并为每一行创建对象,然后返回您可以实际迭代的内容。然后你就将所有内容都存储在内存中,结果就会溢出。
根据我对文档的阅读,
iterator()
只是绕过 QuerySet 的内部缓存机制。我认为做一件一件的事情可能是有意义的,但这反过来需要对数据库进行一千万次单独的点击。也许并不是那么理想。有效地迭代大型数据集是我们还没有完全正确的事情,但是您可能会发现一些片段对您的目的有用:
Nate C was close, but not quite.
From the docs:
So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.
From my reading of the docs,
iterator()
does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:
可能不是更快或最有效,但作为一个现成的解决方案,为什么不使用此处记录的 django 核心的 Paginator 和 Page 对象:
https://docs.djangoproject.com/en/dev/topics/pagination/
像这样的:
Might not be the faster or most efficient, but as a ready-made solution why not use django core's Paginator and Page objects documented here:
https://docs.djangoproject.com/en/dev/topics/pagination/
Something like this:
Django 的默认行为是在评估查询时缓存 QuerySet 的整个结果。您可以使用 QuerySet 的迭代器方法来避免这种缓存:
https:// docs.djangoproject.com/en/stable/ref/models/querysets/#iterator
iterator() 方法评估查询集,然后直接读取结果,而不在查询集级别进行缓存。当迭代大量只需要访问一次的对象时,此方法会带来更好的性能并显着减少内存。请注意,缓存仍然是在数据库级别完成的。
使用 iterator() 减少了我的内存使用量,但它仍然高于我的预期。使用 mpaf 建议的分页器方法使用的内存要少得多,但对于我的测试用例来说速度慢 2-3 倍。
Django's default behavior is to cache the whole result of the QuerySet when it evaluates the query. You can use the QuerySet's iterator method to avoid this caching:
https://docs.djangoproject.com/en/stable/ref/models/querysets/#iterator
The iterator() method evaluates the queryset and then reads the results directly without doing caching at the QuerySet level. This method results in better performance and a significant reduction in memory when iterating over a large number of objects that you only need to access once. Note that caching is still done at the database level.
Using iterator() reduces memory usage for me, but it is still higher than I expected. Using the paginator approach suggested by mpaf uses much less memory, but is 2-3x slower for my test case.
对于大量记录,数据库游标性能更好。你确实需要 Django 中的原始 SQL,Django 游标与 SQL 游标不同。
Nate C 建议的 LIMIT - OFFSET 方法可能足以满足您的情况。对于大量数据,它比游标慢,因为它必须一遍又一遍地运行相同的查询,并且必须跳过越来越多的结果。
For large amounts of records, a database cursor performs even better. You do need raw SQL in Django, the Django-cursor is something different than a SQL cursur.
The LIMIT - OFFSET method suggested by Nate C might be good enough for your situation. For large amounts of data it is slower than a cursor because it has to run the same query over and over again and has to jump over more and more results.
Django 没有很好的解决方案来从数据库中获取大项目。
values_list 可以用来获取数据库中所有的id,然后获取每个对象单独。随着时间的推移,大对象将在内存中创建,并且在退出 for 循环之前不会被垃圾收集。上面的代码在每消耗第 100 个项目后进行手动垃圾回收。
Django doesn't have good solution for fetching large items from database.
values_list can be used to fetch all the ids in the databases and then fetch each object separately. Over a time large objects will be created in memory and won't be garbage collected til for loop is exited. Above code does manual garbage collection after every 100th item is consumed.
这里有很多过时的结果。不确定何时添加,但 Django 的 QuerySet.iterator() 方法 使用具有块大小的服务器端游标,从数据库传输结果。因此,如果您使用的是 postgres,现在应该可以为您处理开箱即用的问题。
There are a lot of outdated results here. Not sure when it was added, but Django's
QuerySet.iterator()
method uses a server-side cursor with a chunk size, to stream results from the database. So if you're using postgres, this should now be handled out of the box for you.这是来自文档:
http://docs.djangoproject.com/en/dev/ref/models/查询集/
因此,当运行
print 事件
时,查询会触发(根据您的命令,这是一次全表扫描。)并加载结果。您要求所有对象,并且如果不获得所有对象就无法获得第一个对象。但如果你这样做:
http://docs.djangoproject .com/en/dev/topics/db/queries/#limiting-querysets
然后它会在内部为sql添加偏移量和限制。
This is from the docs:
http://docs.djangoproject.com/en/dev/ref/models/querysets/
So when the
print event
is run the query fires (which is a full table scan according to your command.) and loads the results. Your asking for all the objects and there is no way to get the first object without getting all of them.But if you do something like:
http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets
Then it will add offsets and limits to the sql internally.
在迭代查询集之前会消耗大量内存,因为整个查询的所有数据库行都会立即处理成对象,并且根据行数可能会进行大量处理。
您可以将查询集分成更小的易消化的位。我将这种模式称为“勺子喂养”。这是我在管理命令中使用的带有进度条的实现,首先
pip3 install tqdm
要使用它,您需要编写一个对对象执行操作的函数:
然后在查询集上运行该函数:
Massive amount of memory gets consumed before the queryset can be iterated because all database rows for a whole query get processed into objects at once and it can be a lot of processing depending on a number of rows.
You can chunk up your queryset into smaller digestible bits. I call the pattern to do this "spoonfeeding". Here's an implementation with a progress-bar I use in my management commands, first
pip3 install tqdm
To use this you write a function that does operations on your object:
and than run that function on your queryset:
这是一个包括 len 和 count 的解决方案:
用法:
Here a solution including len and count:
Usage:
我通常使用原始 MySQL 原始查询而不是 Django ORM 来完成此类任务。
MySQL支持流模式,因此我们可以安全快速地循环所有记录,而不会出现内存不足错误。
参考:
I usually use raw MySQL raw query instead of Django ORM for this kind of task.
MySQL supports streaming mode so we can loop through all records safely and fast without out of memory error.
Ref: