限制*大型* Django 查询集中的内存使用
我有一个任务需要每隔一段时间在数据库中的“大多数”对象上运行一次(每天一次,每周一次,等等)。基本上这意味着我有一些看起来像这样在它自己的线程中运行的查询。
for model_instance in SomeModel.objects.all():
do_something(model_instance)
(请注意,它实际上是一个 filter() 而不是 all(),但尽管如此,我最终仍然选择了一组非常大的对象。)
我遇到的问题是运行后有一段时间,该线程被我的托管提供商杀死,因为我使用了太多内存。我假设所有这些内存使用都发生了,因为即使我的查询返回的QuerySet
对象最初的内存占用量非常小,但它最终会随着当我迭代每个
对象会缓存它们。model_instance
时,QuerySet
我的问题是,“以内存有效的方式迭代数据库中几乎每个 SomeModel
的最佳方法是什么?”或者也许我的问题是“如何从 django 查询集中‘取消缓存’模型实例?”
编辑:我实际上正在使用查询集的结果来构建一系列新对象。因此,我最终根本不会更新查询的对象。
I have a task which needs to be run on 'most' objects in my database once every some period of time (once a day, once a week, whatever). Basically this means that I have some query that looks like this running in it's own thread.
for model_instance in SomeModel.objects.all():
do_something(model_instance)
(Note that it's actually a filter() not all() but none-the-less I still end up selecting a very large set of objects.)
The problem I'm running into is that after running for a while the thread is killed by my hosting provider because I'm using too much memory. I'm assuming all this memory use is happening because even though the QuerySet
object returned by my query initially has a very small memory footprint it ends up growing as the QuerySet
object caches each model_instance
as I iterate through them.
My question is, "what is the best way to iterate through almost every SomeModel
in my database in a memory efficient way?" or perhaps my question is "how do I 'un-cache' model instances from a django queryset?"
EDIT: I'm actually using the results of the queryset to build a series of new objects. As such, I don't end up updating the queried-for objects at all.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如何使用此处记录的 django core 的 Paginator 和 Page 对象:
https://docs.djangoproject .com/en/dev/topics/pagination/
像这样:
What about using django core's Paginator and Page objects documented here:
https://docs.djangoproject.com/en/dev/topics/pagination/
Something like this:
所以我实际上最终做的是构建一些可以“包装”查询集的东西。它的工作原理是使用切片语法对查询集进行深度复制 - 例如,
some_queryset[15:45]
--但是当切片完全迭代后,它会生成原始 QuerySet 的另一个深层副本。这意味着只有“此”特定切片中返回的对象集存储在内存中。因此,而不是...
您会做...
请注意,这样做的目的是为了Python解释器中节省内存。它本质上是通过进行更多数据库查询来实现这一点的。通常人们会尝试做完全相反的事情——即,尽可能地减少数据库查询,而不考虑内存使用情况。希望有人会发现这很有用。
So what I actually ended up doing is building something that you can 'wrap' a QuerySet in. It works by making a deepcopy of the QuerySet, using the slice syntax--e.g.,
some_queryset[15:45]
--but then it makes another deepcopy of the original QuerySet when the slice has been completely iterated through. This means that only the set of Objects returned in 'this' particular slice are stored in memory.So instead of...
You would do...
Please note that the intention of this is to save memory in your Python interpreter. It essentially does this by making more database queries. Usually people are trying to do the exact opposite of that--i.e., minimize database queries as much as possible without regards to memory usage. Hopefully somebody will find this useful though.
您不能使用 Model.objects.all().iterator() ,因为它会立即获取表中的所有元素。您也不能使用 Model.objects.all()[offset:offset+pagesize],因为它会缓存结果。两者都会超出您的内存限制。
我尝试混合使用这两种解决方案,并且它有效:
更改
pagesize
以满足您的要求,并可选择将[offset : offset + pagesize]
更改为[offset * pagesize : (offset + 1) * pagesize]
习语(如果它更适合您)。当然,还将Model
替换为您的实际型号名称。You can't use
Model.objects.all().iterator()
because it will fetch all the elements on your table at once. Neither can you useModel.objects.all()[offset:offset+pagesize]
, because it will cache the results. Either will exceed your memory limit.I've tried to mix both solutions, and it worked:
Change
pagesize
to fit your requirements, and optionally change the[offset : offset + pagesize]
to the[offset * pagesize : (offset + 1) * pagesize]
idiom if it fits you better. Also, of course, replaceModel
by your actual model name.许多解决方案通过对查询集进行切片来实现 SQL
OFFSET
和LIMIT
。正如斯特凡诺指出的那样,对于更大的数据集,这变得非常低效。处理此问题的正确方法是使用服务器端光标来跟踪 OFFSET。本机服务器端光标支持正在为 django 开发。在准备就绪之前,如果您将 postgres 与 psycopg2 后端一起使用,这里有一个简单的实现:
请参阅 这篇博文 很好地解释了 django 中大型查询的内存问题。
Many solutions implement sql
OFFSET
andLIMIT
via slicing the queryset. As stefano notes, with larger datasets this becomes very inefficient. The proper way of handling this is to use server-side cursers to keep track of the OFFSET.Native server-side cursor support is in the works for django. Until it's ready, here is a simple implementation if you are using postgres with the psycopg2 backend:
See this blog post for a great explanation of memory issues from large queries in django.
简短回答
如果您使用 PostgreSQL 或 Oracle,则可以使用 Django 的 内置迭代器:
这会导致 Django 使用 服务器端游标,而不是在迭代查询集时缓存模型。从 Django 4.1 开始,这甚至可以与
prefetch_lated
一起使用。对于其他数据库,您可以使用以下内容:
如果您想要返回页面而不是单个对象以与其他优化(例如
bulk_update
)结合使用,请使用:PostgreSQL 上的性能分析
我分析了许多不同的Django 3.2 和 Postgres 13 上大约有 200,000 行的 PostgreSQL 表上的方法。对于每个查询,我将 ids 的总和相加,既确保 Django 实际检索对象,也使我可以验证查询之间迭代的正确性。所有计时都是在对相关表进行多次迭代后进行的,以最大限度地减少后续测试的缓存优势。
基本迭代
基本方法就是迭代表。这种方法的主要问题是所使用的内存量不是恒定的;它随着表的大小而增长,并且我已经看到在较大的表上内存不足。
Django 迭代器
Django 迭代器(至少从 Django 3.2 开始)修复了内存问题,但性能提升较小。据推测,这是因为 Django 管理缓存的时间减少了。
自定义迭代器
自然的比较点是尝试通过逐渐增加对主键的查询来自己进行分页。虽然这是对简单迭代的改进,因为它具有恒定的内存,但它实际上在速度上输给了 Django 的内置迭代器,因为它进行了更多的数据库查询。
自定义分页功能
使用自定义迭代的主要原因是这样您可以获得分页结果。此函数对于在仅使用常量内存时插入批量更新非常有用。在我的测试中,它比 queryset_iterator 慢一点,并且我没有一个连贯的理论来解释为什么,但速度减慢并不严重。
替代自定义分页功能
鉴于 Django 的查询集迭代器比我们自己进行分页更快,因此可以替代实现查询集分页器来使用它。它比我们自己进行分页要快一点,但实现起来比较混乱。可读性很重要,这就是为什么我个人更喜欢前一个分页功能,但如果您的查询集在结果中没有主键(无论出于何种原因),这个功能可能会更好。
不良方法
以下是您永远不应该使用的方法(问题中建议了其中许多方法)以及原因。
不要对无序查询集使用切片
无论你做什么,都不要对无序查询集进行切片。这不能正确地迭代表。原因是切片操作根据您的查询集执行 SQL 限制 + 偏移量查询,并且 django 查询集没有顺序保证,除非您使用
order_by.此外,PostgreSQL 没有默认的排序依据,并且 Postgres 文档特别警告不要使用限制 + 没有 order by 的偏移量。因此,每次获取切片时,您都会获得表的不确定性切片,这意味着您的切片可能不是重叠并且不会覆盖它们之间表格的所有行。根据我的经验,只有当您在进行迭代时有其他东西正在修改表中的数据时,才会发生这种情况,这只会使这个问题更加严重,因为这意味着如果您单独测试代码,则该错误可能不会出现。
一般情况下不要对全表迭代使用切片
即使我们对查询集进行排序,从性能角度来看列表切片也是很糟糕的。这是因为 SQL offset 是线性时间操作,这意味着表的 limit + offset 分页迭代将是二次时间,这是您绝对不希望的。
不要使用 Django 的分页器进行全表迭代
Django 带有内置的 分页器。人们可能会认为这适合对数据库进行分页迭代,但事实并非如此。 Paginator 的作用是将单页结果返回到 UI 或 API 端点。它比任何迭代表的好方法都要慢得多。
Short Answer
If you are using PostgreSQL or Oracle, you can use, Django's builtin iterator:
This causes Django to use server-side cursors and not cache models as it iterates through the queryset. As of Django 4.1, this will even work with
prefetch_related
.For other databases, you can use the following:
If you want to get back pages rather than individual objects to combine with other optimizations such as
bulk_update
, use this:Performance Profiling on PostgreSQL
I profiled a number of different approaches on a PostgreSQL table with about 200,000 rows on Django 3.2 and Postgres 13. For every query, I added up the sum of the ids, both to ensure that Django was actually retrieving the objects and so that I could verify correctness of iteration between queries. All of the timings were taken after several iterations over the table in question to minimize caching advantages of later tests.
Basic Iteration
The basic approach is just iterating over the table. The main issue with this approach is that the amount of memory used is not constant; it grows with the size of the table, and I've seen this run out of memory on larger tables.
Django Iterator
The Django iterator (at least as of Django 3.2) fixes the memory issue with minor performance benefit. Presumably this comes from Django spending less time managing cache.
Custom Iterator
The natural comparison point is attempting to do the paging ourselves by progresively increased queries on the primary key. While this is an improvement over naieve iteration in that it has constant memory, it actually loses to Django's built-in iterator on speed because it makes more database queries.
Custom Paging Function
The main reason to use the custom iteration is so that you can get the results in pages. This function is very useful to then plug in to bulk-updates while only using constant memory. It's a bit slower than queryset_iterator in my tests and I don't have a coherent theory as to why, but the slowdown isn't substantial.
Alternative Custom Paging Function
Given that Django's queryset iterator is faster than doing paging ourselves, the queryset pager can be alternately implemented to use it. It's a little bit faster than doing paging ourselves, but the implementation is messier. Readability matters, which is why my personal preference is the previous paging function, but this one can be better if your queryset doesn't have a primary key in the results (for whatever reason).
Bad Approaches
The following are approaches you should never use (many of which are suggested in the question) along with why.
Do NOT Use Slicing on an Unordered Queryset
Whatever you do, do NOT slice an unordered queryset. This does not correctly iterate over the table. The reason for this is that the slice operation does a SQL limit + offset query based on your queryset and that django querysets have no order guarantee unless you use
order_by
. Additionally, PostgreSQL does not have a default order by, and the Postgres docs specifically warn against using limit + offset without order by. As a result, each time you take a slice, you are getting a non-deterministic slice of your table, which means your slices may not be overlapping and won't cover all rows of the table between them. In my experience, this only happens if something else is modifying data in the table while you are doing the iteration, which only makes this problem more pernicious because it means the bug might not show up if you are testing your code in isolation.Do NOT use Slicing for Whole-Table Iteration in General
Even if we order the queryset, list slicing is abysmal from a performance perspective. This is because SQL offset is a linear time operation, which means that a limit + offset paged iteration of a table will be quadratic time, which you absolutely do not want.
Do NOT use Django's Paginator for Whole-Table Iteration
Django comes with a built-in Paginator. It may be tempting to think that is appropriate for doing a paged iteration of a database, but it is not. The point of Paginator is for returning a single page of a result to a UI or an API endpoint. It is substantially slower than any of the good apporaches at iterating over a table.
有一个 django 片段:
http://djangosnippets.org/snippets/1949/
它迭代通过生成原始查询集的较小“块”的行来对查询集进行处理。它最终使用的内存显着减少,同时允许您调整速度。我在我的一个项目中使用它。
There is a django snippet for this:
http://djangosnippets.org/snippets/1949/
It iterates over a queryset by yielding rows of smaller "chunks" of the original queryset. It ends up using significantly less memory while allowing you to tune for speed. I use it in one of my projects.
我正在继续研究,看起来我想做相当于 SQL OFFSET 和 LIMIT 的操作,根据 Django 文档关于限制查询集 意味着我想使用切片语法,例如
SomeModel.objects.all()[15:25]
所以现在我想也许这样的东西就是我正在寻找的:
根据我的计算,这将使
smaller_queryset
永远不会变得太大。I'm continuing research and it kind of looks like I want to do the equivalent of an SQL OFFSET and LIMIT, which according to Django Doc's on Limiting Querysets means I want to use the slice syntax, e.g.,
SomeModel.objects.all()[15:25]
So now I'm thinking maybe something like this is what I'm looking for:
By my reckoning this would make it so that
smaller_queryset
would never grow too large.以下方法不使用昂贵的数据库偏移量查询,并且避免计算页码,从而更加高效。
文档字符串中指定的限制。
The following approach doesn't use an expensive database offset query and avoids calculating the page number, making it more efficient.
Limitations specified in the docstring.