限制*大型* Django 查询集中的内存使用

发布于 2024-10-15 02:33:02 字数 638 浏览 4 评论 0原文

我有一个任务需要每隔一段时间在数据库中的“大多数”对象上运行一次(每天一次,每周一次,等等)。基本上这意味着我有一些看起来像这样在它自己的线程中运行的查询。

for model_instance in SomeModel.objects.all():
    do_something(model_instance)

(请注意,它实际上是一个 filter() 而不是 all(),但尽管如此,我最终仍然选择了一组非常大的对象。)

我遇到的问题是运行后有一段时间,该线程被我的托管提供商杀死,因为我使用了太多内存。我假设所有这些内存使用都发生了,因为即使我的查询返回的QuerySet对象最初的内存占用量非常小,但它最终会随着当我迭代每个 model_instance 时,QuerySet 对象会缓存它们。

我的问题是,“以内存有效的方式迭代数据库中几乎每个 SomeModel 的最佳方法是什么?”或者也许我的问题是“如何从 django 查询集中‘取消缓存’模型实例?”

编辑:我实际上正在使用查询集的结果来构建一系列新对象。因此,我最终根本不会更新查询的对象。

I have a task which needs to be run on 'most' objects in my database once every some period of time (once a day, once a week, whatever). Basically this means that I have some query that looks like this running in it's own thread.

for model_instance in SomeModel.objects.all():
    do_something(model_instance)

(Note that it's actually a filter() not all() but none-the-less I still end up selecting a very large set of objects.)

The problem I'm running into is that after running for a while the thread is killed by my hosting provider because I'm using too much memory. I'm assuming all this memory use is happening because even though the QuerySet object returned by my query initially has a very small memory footprint it ends up growing as the QuerySet object caches each model_instance as I iterate through them.

My question is, "what is the best way to iterate through almost every SomeModel in my database in a memory efficient way?" or perhaps my question is "how do I 'un-cache' model instances from a django queryset?"

EDIT: I'm actually using the results of the queryset to build a series of new objects. As such, I don't end up updating the queried-for objects at all.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

别再吹冷风 2024-10-22 02:33:02

如何使用此处记录的 django core 的 Paginator 和 Page 对象:

https://docs.djangoproject .com/en/dev/topics/pagination/

像这样:

from django.core.paginator import Paginator
from djangoapp.models import SomeModel

queryset = SomeModel.objects.all().order_by("pk") # sorting is necessary for correctness
paginator = Paginator(queryset, 1000) # chunks of 1000

for page_idx in range(1, paginator.num_pages):
    for row in paginator.page(page_idx).object_list:
        # here you can do what you want with the row
    print "done processing page %s" % page_idx

What about using django core's Paginator and Page objects documented here:

https://docs.djangoproject.com/en/dev/topics/pagination/

Something like this:

from django.core.paginator import Paginator
from djangoapp.models import SomeModel

queryset = SomeModel.objects.all().order_by("pk") # sorting is necessary for correctness
paginator = Paginator(queryset, 1000) # chunks of 1000

for page_idx in range(1, paginator.num_pages):
    for row in paginator.page(page_idx).object_list:
        # here you can do what you want with the row
    print "done processing page %s" % page_idx
甜尕妞 2024-10-22 02:33:02

所以我实际上最终做的是构建一些可以“包装”查询集的东西。它的工作原理是使用切片语法对查询集进行深度复制 - 例如,some_queryset[15:45]--但是当切片完全迭代后,它会生成原始 QuerySet 的另一个深层副本。这意味着只有“此”特定切片中返回的对象集存储在内存中。

class MemorySavingQuerysetIterator(object):

    def __init__(self,queryset,max_obj_num=1000):
        self._base_queryset = queryset
        self._generator = self._setup()
        self.max_obj_num = max_obj_num

    def _setup(self):
        for i in xrange(0,self._base_queryset.count(),self.max_obj_num):
            # By making a copy of of the queryset and using that to actually access
            # the objects we ensure that there are only `max_obj_num` objects in
            # memory at any given time
            smaller_queryset = copy.deepcopy(self._base_queryset)[i:i+self.max_obj_num]
            logger.debug('Grabbing next %s objects from DB' % self.max_obj_num)
            for obj in smaller_queryset.iterator():
                yield obj

    def __iter__(self):
        return self

    def next(self):
        return self._generator.next()

因此,而不是...

for obj in SomeObject.objects.filter(foo='bar'): <-- Something that returns *a lot* of Objects
    do_something(obj);

您会做...

for obj in MemorySavingQuerysetIterator(in SomeObject.objects.filter(foo='bar')):
    do_something(obj);

请注意,这样做的目的是为了Python解释器节省内存。它本质上是通过进行更多数据库查询来实现这一点的。通常人们会尝试做完全相反的事情——即,尽可能地减少数据库查询,而不考虑内存使用情况。希望有人会发现这很有用。

So what I actually ended up doing is building something that you can 'wrap' a QuerySet in. It works by making a deepcopy of the QuerySet, using the slice syntax--e.g., some_queryset[15:45]--but then it makes another deepcopy of the original QuerySet when the slice has been completely iterated through. This means that only the set of Objects returned in 'this' particular slice are stored in memory.

class MemorySavingQuerysetIterator(object):

    def __init__(self,queryset,max_obj_num=1000):
        self._base_queryset = queryset
        self._generator = self._setup()
        self.max_obj_num = max_obj_num

    def _setup(self):
        for i in xrange(0,self._base_queryset.count(),self.max_obj_num):
            # By making a copy of of the queryset and using that to actually access
            # the objects we ensure that there are only `max_obj_num` objects in
            # memory at any given time
            smaller_queryset = copy.deepcopy(self._base_queryset)[i:i+self.max_obj_num]
            logger.debug('Grabbing next %s objects from DB' % self.max_obj_num)
            for obj in smaller_queryset.iterator():
                yield obj

    def __iter__(self):
        return self

    def next(self):
        return self._generator.next()

So instead of...

for obj in SomeObject.objects.filter(foo='bar'): <-- Something that returns *a lot* of Objects
    do_something(obj);

You would do...

for obj in MemorySavingQuerysetIterator(in SomeObject.objects.filter(foo='bar')):
    do_something(obj);

Please note that the intention of this is to save memory in your Python interpreter. It essentially does this by making more database queries. Usually people are trying to do the exact opposite of that--i.e., minimize database queries as much as possible without regards to memory usage. Hopefully somebody will find this useful though.

梦回旧景 2024-10-22 02:33:02

您不能使用 Model.objects.all().iterator() ,因为它会立即获取表中的所有元素。您也不能使用 Model.objects.all()[offset:offset+pagesize],因为它会缓存结果。两者都会超出您的内存限制。

我尝试混合使用这两种解决方案,并且它有效:

offset = 0
pagesize = 1000
count = Model.objects.all().count()
while offset < count:
    for m in Model.objects.all()[offset : offset + pagesize].iterator:
        do_something with m
    offset += pagesize

更改 pagesize 以满足您的要求,并可选择将 [offset : offset + pagesize] 更改为 [offset * pagesize : (offset + 1) * pagesize] 习语(如果它更适合您)。当然,还将 Model 替换为您的实际型号名称。

You can't use Model.objects.all().iterator() because it will fetch all the elements on your table at once. Neither can you use Model.objects.all()[offset:offset+pagesize], because it will cache the results. Either will exceed your memory limit.

I've tried to mix both solutions, and it worked:

offset = 0
pagesize = 1000
count = Model.objects.all().count()
while offset < count:
    for m in Model.objects.all()[offset : offset + pagesize].iterator:
        do_something with m
    offset += pagesize

Change pagesize to fit your requirements, and optionally change the [offset : offset + pagesize] to the [offset * pagesize : (offset + 1) * pagesize] idiom if it fits you better. Also, of course, replace Model by your actual model name.

情绪 2024-10-22 02:33:02

许多解决方案通过对查询集进行切片来实现 SQL OFFSETLIMIT。正如斯特凡诺指出的那样,对于更大的数据集,这变得非常低效。处理此问题的正确方法是使用服务器端光标来跟踪 OFFSET。

本机服务器端光标支持正在为 django 开发。在准备就绪之前,如果您将 postgres 与 psycopg2 后端一起使用,这里有一个简单的实现:

def server_cursor_query(Table):
    table_name = Table._meta.db_table

    # There must be an existing connection before creating a server-side cursor
    if connection.connection is None:
        dummy_cursor = connection.cursor()  # not a server-side cursor

    # Optionally keep track of the columns so that we can return a QuerySet. However,
    # if your table has foreign keys, you may need to rename them appropriately
    columns = [x.name for x in Table._meta.local_fields]

    cursor = connection.connection.cursor(name='gigantic_cursor')) # a server-side
                                                                   # cursor

    with transaction.atomic():
        cursor.execute('SELECT {} FROM {} WHERE id={}'.format(
            ', '.join(columns), table_name, id))

        while True:
            rows = cursor.fetchmany(1000)

                if not rows:
                    break

                for row in rows:
                    fields = dict(zip(columns, row))
                    yield Table(**fields)

请参阅 这篇博文 很好地解释了 django 中大型查询的内存问题。

Many solutions implement sql OFFSET and LIMIT via slicing the queryset. As stefano notes, with larger datasets this becomes very inefficient. The proper way of handling this is to use server-side cursers to keep track of the OFFSET.

Native server-side cursor support is in the works for django. Until it's ready, here is a simple implementation if you are using postgres with the psycopg2 backend:

def server_cursor_query(Table):
    table_name = Table._meta.db_table

    # There must be an existing connection before creating a server-side cursor
    if connection.connection is None:
        dummy_cursor = connection.cursor()  # not a server-side cursor

    # Optionally keep track of the columns so that we can return a QuerySet. However,
    # if your table has foreign keys, you may need to rename them appropriately
    columns = [x.name for x in Table._meta.local_fields]

    cursor = connection.connection.cursor(name='gigantic_cursor')) # a server-side
                                                                   # cursor

    with transaction.atomic():
        cursor.execute('SELECT {} FROM {} WHERE id={}'.format(
            ', '.join(columns), table_name, id))

        while True:
            rows = cursor.fetchmany(1000)

                if not rows:
                    break

                for row in rows:
                    fields = dict(zip(columns, row))
                    yield Table(**fields)

See this blog post for a great explanation of memory issues from large queries in django.

合久必婚 2024-10-22 02:33:02

简短回答

如果您使用 PostgreSQL 或 Oracle,则可以使用 Django 的 内置迭代器

queryset.iterator(chunk_size=1000)

这会导致 Django 使用 服务器端游标,而不是在迭代查询集时缓存模型。从 Django 4.1 开始,这甚至可以与 prefetch_lated 一起使用。

对于其他数据库,您可以使用以下内容:

def queryset_iterator(queryset, page_size=1000):
    page = queryset.order_by("pk")[:page_size]
    while page:
        for obj in page:
            yield obj
            pk = obj.pk
        page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]

如果您想要返回页面而不是单个对象以与其他优化(例如 bulk_update)结合使用,请使用:

def queryset_to_pages(queryset, page_size=1000):
    page = queryset.order_by("pk")[:page_size]
    while page:
        yield page
        pk = max(obj.pk for obj in page)
        page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]

PostgreSQL 上的性能分析

我分析了许多不同的Django 3.2 和 Postgres 13 上大约有 200,000 行的 PostgreSQL 表上的方法。对于每个查询,我将 ids 的总和相加,既确保 Django 实际检索对象,也使我可以验证查询之间迭代的正确性。所有计时都是在对相关表进行多次迭代后进行的,以最大限度地减少后续测试的缓存优势。

基本迭代

基本方法就是迭代表。这种方法的主要问题是所使用的内存量不是恒定的;它随着表的大小而增长,并且我已经看到在较大的表上内存不足。

x = sum(i.id for i in MyModel.objects.all())

挂载时间:3.53 秒,22MB 内存(糟糕)

Django 迭代器

Django 迭代器(至少从 Django 3.2 开始)修复了内存问题,但性能提升较小。据推测,这是因为 Django 管理缓存的时间减少了。

assert sum(i.id for i in MyModel.objects.all().iterator(chunk_size=1000)) == x

上电时间:3.11 秒,<1MB 内存

自定义迭代器

自然的比较点是尝试通过逐渐增加对主键的查询来自己进行分页。虽然这是对简单迭代的改进,因为它具有恒定的内存,但它实际上在速度上输给了 Django 的内置迭代器,因为它进行了更多的数据库查询。

def queryset_iterator(queryset, page_size=1000):
    page = queryset.order_by("pk")[:page_size]
    while page:
        for obj in page:
            yield obj
            pk = obj.pk
        page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]

assert sum(i.id for i in queryset_iterator(MyModel.objects.all())) == x

挂载时间:3.65 秒,<1MB 内存

自定义分页功能

使用自定义迭代的主要原因是这样您可以获得分页结果。此函数对于在仅使用常量内存时插入批量更新非常有用。在我的测试中,它比 queryset_iterator 慢一点,并且我没有一个连贯的理论来解释为什么,但速度减慢并不严重。

def queryset_to_pages(queryset, page_size=1000):
    page = queryset.order_by("pk")[:page_size]
    while page:
        yield page
        pk = max(obj.pk for obj in page)
        page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]

assert sum(i.id for page in queryset_to_pages(MyModel.objects.all()) for i in page) == x

挂载时间:4.49 秒,<1MB 内存

替代自定义分页功能

鉴于 Django 的查询集迭代器比我们自己进行分页更快,因此可以替代实现查询集分页器来使用它。它比我们自己进行分页要快一点,但实现起来比较混乱。可读性很重要,这就是为什么我个人更喜欢前一个分页功能,但如果您的查询集在结果中没有主键(无论出于何种原因),这个功能可能会更好。

def queryset_to_pages2(queryset, page_size=1000):
    page = []
    page_count = 0
    for obj in queryset.iterator():
        page.append(obj)
        page_count += 1
        if page_count == page_size:
            yield page
            page = []
            page_count = 0
    yield page

assert sum(i.id for page in queryset_to_pages2(MyModel.objects.all()) for i in page) == x

挂载时间:4.33 秒,<1MB 内存


不良方法

以下是您永远不应该使用的方法(问题中建议了其中许多方法)以及原因。

不要对无序查询集使用切片

无论你做什么,都不要对无序查询集进行切片。这不能正确地迭代表。原因是切片操作根据您的查询集执行 SQL 限制 + 偏移量查询,并且 django 查询集没有顺序保证,除非您使用 order_by.此外,PostgreSQL 没有默认的排序依据,并且 Postgres 文档特别警告不要使用限制 + 没有 order by 的偏移量。因此,每次获取切片时,您都会获得表的不确定性切片,这意味着您的切片可能不是重叠并且不会覆盖它们之间表格的所有行。根据我的经验,只有当您在进行迭代时有其他东西正在修改表中的数据时,才会发生这种情况,这只会使这个问题更加严重,因为这意味着如果您单独测试代码,则该错误可能不会出现。

def very_bad_iterator(queryset, page_size=1000):
    counter = 0
    count = queryset.count()
    while counter < count:     
        for model in queryset[counter:counter+page_size].iterator():
            yield model
        counter += page_size

assert sum(i.id for i in very_bad_iterator(MyModel.objects.all())) == x

断言错误;即计算结果不正确!!!

一般情况下不要对全表迭代使用切片

即使我们对查询集进行排序,从性能角度来看列表切片也是很糟糕的。这是因为 SQL offset 是线性时间操作,这意味着表的 limit + offset 分页迭代将是二次时间,这是您绝对不希望的。

def bad_iterator(queryset, page_size=1000):
    counter = 0
    count = queryset.count()
    while counter < count:     
        for model in queryset.order_by("id")[counter:counter+page_size].iterator():
            yield model
        counter += page_size

assert sum(i.id for i in bad_iterator(MyModel.objects.all())) == x

运行时间:15 秒(BAD),<1MB 内存

不要使用 Django 的分页器进行全表迭代

Django 带有内置的 分页器。人们可能会认为这适合对数据库进行分页迭代,但事实并非如此。 Paginator 的作用是将单页结果返回到 UI 或 API 端点。它比任何迭代表的好方法都要慢得多。

from django.core.paginator import Paginator

def bad_paged_iterator(queryset, page_size=1000):
    p = Paginator(queryset.order_by("pk"), page_size)
    for i in p.page_range:
        yield p.get_page(i)
        
assert sum(i.id for page in bad_paged_iterator(MyModel.objects.all()) for i in page) == x

运行时间:13.1 秒(BAD),<1MB 内存

Short Answer

If you are using PostgreSQL or Oracle, you can use, Django's builtin iterator:

queryset.iterator(chunk_size=1000)

This causes Django to use server-side cursors and not cache models as it iterates through the queryset. As of Django 4.1, this will even work with prefetch_related.

For other databases, you can use the following:

def queryset_iterator(queryset, page_size=1000):
    page = queryset.order_by("pk")[:page_size]
    while page:
        for obj in page:
            yield obj
            pk = obj.pk
        page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]

If you want to get back pages rather than individual objects to combine with other optimizations such as bulk_update, use this:

def queryset_to_pages(queryset, page_size=1000):
    page = queryset.order_by("pk")[:page_size]
    while page:
        yield page
        pk = max(obj.pk for obj in page)
        page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]

Performance Profiling on PostgreSQL

I profiled a number of different approaches on a PostgreSQL table with about 200,000 rows on Django 3.2 and Postgres 13. For every query, I added up the sum of the ids, both to ensure that Django was actually retrieving the objects and so that I could verify correctness of iteration between queries. All of the timings were taken after several iterations over the table in question to minimize caching advantages of later tests.

Basic Iteration

The basic approach is just iterating over the table. The main issue with this approach is that the amount of memory used is not constant; it grows with the size of the table, and I've seen this run out of memory on larger tables.

x = sum(i.id for i in MyModel.objects.all())

Wall time: 3.53 s, 22MB of memory (BAD)

Django Iterator

The Django iterator (at least as of Django 3.2) fixes the memory issue with minor performance benefit. Presumably this comes from Django spending less time managing cache.

assert sum(i.id for i in MyModel.objects.all().iterator(chunk_size=1000)) == x

Wall time: 3.11 s, <1MB of memory

Custom Iterator

The natural comparison point is attempting to do the paging ourselves by progresively increased queries on the primary key. While this is an improvement over naieve iteration in that it has constant memory, it actually loses to Django's built-in iterator on speed because it makes more database queries.

def queryset_iterator(queryset, page_size=1000):
    page = queryset.order_by("pk")[:page_size]
    while page:
        for obj in page:
            yield obj
            pk = obj.pk
        page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]

assert sum(i.id for i in queryset_iterator(MyModel.objects.all())) == x

Wall time: 3.65 s, <1MB of memory

Custom Paging Function

The main reason to use the custom iteration is so that you can get the results in pages. This function is very useful to then plug in to bulk-updates while only using constant memory. It's a bit slower than queryset_iterator in my tests and I don't have a coherent theory as to why, but the slowdown isn't substantial.

def queryset_to_pages(queryset, page_size=1000):
    page = queryset.order_by("pk")[:page_size]
    while page:
        yield page
        pk = max(obj.pk for obj in page)
        page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]

assert sum(i.id for page in queryset_to_pages(MyModel.objects.all()) for i in page) == x

Wall time: 4.49 s, <1MB of memory

Alternative Custom Paging Function

Given that Django's queryset iterator is faster than doing paging ourselves, the queryset pager can be alternately implemented to use it. It's a little bit faster than doing paging ourselves, but the implementation is messier. Readability matters, which is why my personal preference is the previous paging function, but this one can be better if your queryset doesn't have a primary key in the results (for whatever reason).

def queryset_to_pages2(queryset, page_size=1000):
    page = []
    page_count = 0
    for obj in queryset.iterator():
        page.append(obj)
        page_count += 1
        if page_count == page_size:
            yield page
            page = []
            page_count = 0
    yield page

assert sum(i.id for page in queryset_to_pages2(MyModel.objects.all()) for i in page) == x

Wall time: 4.33 s, <1MB of memory


Bad Approaches

The following are approaches you should never use (many of which are suggested in the question) along with why.

Do NOT Use Slicing on an Unordered Queryset

Whatever you do, do NOT slice an unordered queryset. This does not correctly iterate over the table. The reason for this is that the slice operation does a SQL limit + offset query based on your queryset and that django querysets have no order guarantee unless you use order_by. Additionally, PostgreSQL does not have a default order by, and the Postgres docs specifically warn against using limit + offset without order by. As a result, each time you take a slice, you are getting a non-deterministic slice of your table, which means your slices may not be overlapping and won't cover all rows of the table between them. In my experience, this only happens if something else is modifying data in the table while you are doing the iteration, which only makes this problem more pernicious because it means the bug might not show up if you are testing your code in isolation.

def very_bad_iterator(queryset, page_size=1000):
    counter = 0
    count = queryset.count()
    while counter < count:     
        for model in queryset[counter:counter+page_size].iterator():
            yield model
        counter += page_size

assert sum(i.id for i in very_bad_iterator(MyModel.objects.all())) == x

Assertion Error; i.e. INCORRECT RESULT COMPUTED!!!

Do NOT use Slicing for Whole-Table Iteration in General

Even if we order the queryset, list slicing is abysmal from a performance perspective. This is because SQL offset is a linear time operation, which means that a limit + offset paged iteration of a table will be quadratic time, which you absolutely do not want.

def bad_iterator(queryset, page_size=1000):
    counter = 0
    count = queryset.count()
    while counter < count:     
        for model in queryset.order_by("id")[counter:counter+page_size].iterator():
            yield model
        counter += page_size

assert sum(i.id for i in bad_iterator(MyModel.objects.all())) == x

Wall time: 15s (BAD), <1MB of memory

Do NOT use Django's Paginator for Whole-Table Iteration

Django comes with a built-in Paginator. It may be tempting to think that is appropriate for doing a paged iteration of a database, but it is not. The point of Paginator is for returning a single page of a result to a UI or an API endpoint. It is substantially slower than any of the good apporaches at iterating over a table.

from django.core.paginator import Paginator

def bad_paged_iterator(queryset, page_size=1000):
    p = Paginator(queryset.order_by("pk"), page_size)
    for i in p.page_range:
        yield p.get_page(i)
        
assert sum(i.id for page in bad_paged_iterator(MyModel.objects.all()) for i in page) == x

Wall time: 13.1 s (BAD), <1MB of memory

压抑⊿情绪 2024-10-22 02:33:02

有一个 django 片段:

http://djangosnippets.org/snippets/1949/

它迭代通过生成原始查询集的较小“块”的行来对查询集进行处理。它最终使用的内存显着减少,同时允许您调整速度。我在我的一个项目中使用它。

There is a django snippet for this:

http://djangosnippets.org/snippets/1949/

It iterates over a queryset by yielding rows of smaller "chunks" of the original queryset. It ends up using significantly less memory while allowing you to tune for speed. I use it in one of my projects.

债姬 2024-10-22 02:33:02

我正在继续研究,看起来我想做相当于 SQL OFFSET 和 LIMIT 的操作,根据 Django 文档关于限制查询集 意味着我想使用切片语法,例如 SomeModel.objects.all()[15:25]

所以现在我想也许这样的东西就是我正在寻找的:

# Figure out the number of objects I can safely hold in memory
# I'll just say 100 for right now
number_of_objects = 100 
count = SomeModel.objects.all().count():
for i in xrange(0,count,number_of_objects):
    smaller_queryset = SomeModel.objects.all()[i:i+number_of_objects]
    for model_instance in smaller_queryset:
        do_something(model_instance)

根据我的计算,这将使 smaller_queryset 永远不会变得太大。

I'm continuing research and it kind of looks like I want to do the equivalent of an SQL OFFSET and LIMIT, which according to Django Doc's on Limiting Querysets means I want to use the slice syntax, e.g., SomeModel.objects.all()[15:25]

So now I'm thinking maybe something like this is what I'm looking for:

# Figure out the number of objects I can safely hold in memory
# I'll just say 100 for right now
number_of_objects = 100 
count = SomeModel.objects.all().count():
for i in xrange(0,count,number_of_objects):
    smaller_queryset = SomeModel.objects.all()[i:i+number_of_objects]
    for model_instance in smaller_queryset:
        do_something(model_instance)

By my reckoning this would make it so that smaller_queryset would never grow too large.

甜是你 2024-10-22 02:33:02

以下方法不使用昂贵的数据库偏移量查询,并且避免计算页码,从而更加高效。
文档字符串中指定的限制。

def queryset_pk_iterator(queryset, batch_size=1000):
    """
    Iterator that splits the queryset into batches to reduce memory consumption.
    Useful in cases where builtin .iterator() method of the queryset skips the "prefetch_related" optimization.

    :param queryset: Queryset to iterate over. The supplied queryset must not specify order and limit/offset.
        Queryset objects must have a monotonically increasing and ordering primary key.
    :param batch_size: Size of the batches into which to split the queryset.
    :return: iterator object
    """
    pk = None
    while True:
        batch_queryset = queryset.order_by('pk')
        if pk is not None:
            batch_queryset = batch_queryset.filter(pk__gt=pk)
        batch_queryset = batch_queryset[:batch_size]
        obj = None
        for obj in batch_queryset:
            yield obj
        if obj is None:
            return
        pk = obj.pk

The following approach doesn't use an expensive database offset query and avoids calculating the page number, making it more efficient.
Limitations specified in the docstring.

def queryset_pk_iterator(queryset, batch_size=1000):
    """
    Iterator that splits the queryset into batches to reduce memory consumption.
    Useful in cases where builtin .iterator() method of the queryset skips the "prefetch_related" optimization.

    :param queryset: Queryset to iterate over. The supplied queryset must not specify order and limit/offset.
        Queryset objects must have a monotonically increasing and ordering primary key.
    :param batch_size: Size of the batches into which to split the queryset.
    :return: iterator object
    """
    pk = None
    while True:
        batch_queryset = queryset.order_by('pk')
        if pk is not None:
            batch_queryset = batch_queryset.filter(pk__gt=pk)
        batch_queryset = batch_queryset[:batch_size]
        obj = None
        for obj in batch_queryset:
            yield obj
        if obj is None:
            return
        pk = obj.pk
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文