为什么迭代大型 Django QuerySet 会消耗大量内存?

发布于 2024-10-03 03:17:04 字数 389 浏览 6 评论 0原文

该表包含大约一千万行。

for event in Event.objects.all():
    print event

这会导致内存使用量稳定增加到 4 GB 左右,此时行会快速打印。第一行打印之前的漫长延迟让我感到惊讶——我预计它几乎会立即打印。

我还尝试了 Event.objects.iterator() ,其行为方式相同。

我不明白 Django 正在将什么加载到内存中,也不明白它为什么要这样做。我期望 Django 在数据库级别迭代结果,这意味着结果将以大致恒定的速率打印(而不是在漫长的等待后一次全部打印)。

我误解了什么?

(我不知道这是否相关,但我正在使用 PostgreSQL。)

The table in question contains roughly ten million rows.

for event in Event.objects.all():
    print event

This causes memory usage to increase steadily to 4 GB or so, at which point the rows print rapidly. The lengthy delay before the first row printed surprised me – I expected it to print almost instantly.

I also tried Event.objects.iterator() which behaved the same way.

I don't understand what Django is loading into memory or why it is doing this. I expected Django to iterate through the results at the database level, which'd mean the results would be printed at roughly a constant rate (rather than all at once after a lengthy wait).

What have I misunderstood?

(I don't know whether it's relevant, but I'm using PostgreSQL.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

笔芯 2024-10-10 03:17:05

Nate C 很接近,但还不够。

来自文档

您可以通过以下方式评估查询集:

  • 迭代。 QuerySet 是可迭代的,并且它会在您第一次迭代它时执行其数据库查询。例如,这将打印数据库中所有条目的标题:

    对于 Entry.objects.all() 中的 e:
        打印电子标题
    

因此,当您第一次进入该循环并获取查询集的迭代形式时,会立即检索一千万行。您所经历的等待是 Django 加载数据库行并为每一行创建对象,然后返回您可以实际迭代的内容。然后你就将所有内容都存储在内存中,结果就会溢出。

根据我对文档的阅读,iterator() 只是绕过 QuerySet 的内部缓存机制。我认为做一件一件的事情可能是有意义的,但这反过来需要对数据库进行一千万次单独的点击。也许并不是那么理想。

有效地迭代大型数据集是我们还没有完全正确的事情,但是您可能会发现一些片段对您的目的有用:

Nate C was close, but not quite.

From the docs:

You can evaluate a QuerySet in the following ways:

  • Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:

    for e in Entry.objects.all():
        print e.headline
    

So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.

From my reading of the docs, iterator() does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.

Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:

旧城空念 2024-10-10 03:17:05

可能不是更快或最有效,但作为一个现成的解决方案,为什么不使用此处记录的 django 核心的 Paginator 和 Page 对象:

https://docs.djangoproject.com/en/dev/topics/pagination/

像这样的:

from django.core.paginator import Paginator
from djangoapp.models import model

paginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can 
                                                 # change this to desired chunk size

for page in range(1, paginator.num_pages + 1):
    for row in paginator.page(page).object_list:
        # here you can do whatever you want with the row
    print "done processing page %s" % page

Might not be the faster or most efficient, but as a ready-made solution why not use django core's Paginator and Page objects documented here:

https://docs.djangoproject.com/en/dev/topics/pagination/

Something like this:

from django.core.paginator import Paginator
from djangoapp.models import model

paginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can 
                                                 # change this to desired chunk size

for page in range(1, paginator.num_pages + 1):
    for row in paginator.page(page).object_list:
        # here you can do whatever you want with the row
    print "done processing page %s" % page
我不吻晚风 2024-10-10 03:17:05

Django 的默认行为是在评估查询时缓存 QuerySet 的整个结果。您可以使用 QuerySet 的迭代器方法来避免这种缓存:

for event in Event.objects.all().iterator():
    print event

https:// docs.djangoproject.com/en/stable/ref/models/querysets/#iterator

iterator() 方法评估查询集,然后直接读取结果,而不在查询集级别进行缓存。当迭代大量只需要访问一次的对象时,此方法会带来更好的性能并显着减少内存。请注意,缓存仍然是在数据库级别完成的。

使用 iterator() 减少了我的内存使用量,但它仍然高于我的预期。使用 mpaf 建议的分页器方法使用的内存要少得多,但对于我的测试用例来说速度慢 2-3 倍。

from django.core.paginator import Paginator

def chunked_iterator(queryset, chunk_size=10000):
    paginator = Paginator(queryset, chunk_size)
    for page in range(1, paginator.num_pages + 1):
        for obj in paginator.page(page).object_list:
            yield obj

for event in chunked_iterator(Event.objects.all()):
    print event

Django's default behavior is to cache the whole result of the QuerySet when it evaluates the query. You can use the QuerySet's iterator method to avoid this caching:

for event in Event.objects.all().iterator():
    print event

https://docs.djangoproject.com/en/stable/ref/models/querysets/#iterator

The iterator() method evaluates the queryset and then reads the results directly without doing caching at the QuerySet level. This method results in better performance and a significant reduction in memory when iterating over a large number of objects that you only need to access once. Note that caching is still done at the database level.

Using iterator() reduces memory usage for me, but it is still higher than I expected. Using the paginator approach suggested by mpaf uses much less memory, but is 2-3x slower for my test case.

from django.core.paginator import Paginator

def chunked_iterator(queryset, chunk_size=10000):
    paginator = Paginator(queryset, chunk_size)
    for page in range(1, paginator.num_pages + 1):
        for obj in paginator.page(page).object_list:
            yield obj

for event in chunked_iterator(Event.objects.all()):
    print event
各自安好 2024-10-10 03:17:05

对于大量记录,数据库游标性能更好。你确实需要 Django 中的原始 SQL,Django 游标与 SQL 游标不同。

Nate C 建议的 LIMIT - OFFSET 方法可能足以满足您的情况。对于大量数据,它比游标慢,因为它必须一遍又一遍地运行相同的查询,并且必须跳过越来越多的结果。

For large amounts of records, a database cursor performs even better. You do need raw SQL in Django, the Django-cursor is something different than a SQL cursur.

The LIMIT - OFFSET method suggested by Nate C might be good enough for your situation. For large amounts of data it is slower than a cursor because it has to run the same query over and over again and has to jump over more and more results.

无畏 2024-10-10 03:17:05

Django 没有很好的解决方案来从数据库中获取大项目。

import gc
# Get the events in reverse order
eids = Event.objects.order_by("-id").values_list("id", flat=True)

for index, eid in enumerate(eids):
    event = Event.object.get(id=eid)
    # do necessary work with event
    if index % 100 == 0:
       gc.collect()
       print("completed 100 items")

values_list 可以用来获取数据库中所有的id,然后获取每个对象单独。随着时间的推移,大对象将在内存中创建,并且在退出 for 循环之前不会被垃圾收集。上面的代码在每消耗第 100 个项目后进行手动垃圾回收。

Django doesn't have good solution for fetching large items from database.

import gc
# Get the events in reverse order
eids = Event.objects.order_by("-id").values_list("id", flat=True)

for index, eid in enumerate(eids):
    event = Event.object.get(id=eid)
    # do necessary work with event
    if index % 100 == 0:
       gc.collect()
       print("completed 100 items")

values_list can be used to fetch all the ids in the databases and then fetch each object separately. Over a time large objects will be created in memory and won't be garbage collected til for loop is exited. Above code does manual garbage collection after every 100th item is consumed.

灯下孤影 2024-10-10 03:17:05

这里有很多过时的结果。不确定何时添加,但 Django 的 QuerySet.iterator() 方法 使用具有块大小的服务器端游标,从数据库传输结果。因此,如果您使用的是 postgres,现在应该可以为您处理开箱即用的问题。

There are a lot of outdated results here. Not sure when it was added, but Django's QuerySet.iterator() method uses a server-side cursor with a chunk size, to stream results from the database. So if you're using postgres, this should now be handled out of the box for you.

梦情居士 2024-10-10 03:17:05

这是来自文档:
http://docs.djangoproject.com/en/dev/ref/models/查询集/

在您执行某些操作来评估查询集之前,不会真正发生数据库活动。

因此,当运行 print 事件 时,查询会触发(根据您的命令,这是一次全表扫描。)并加载结果。您要求所有对象,并且如果不获得所有对象就无法获得第一个对象。

但如果你这样做:

Event.objects.all()[300:900]

http://docs.djangoproject .com/en/dev/topics/db/queries/#limiting-querysets

然后它会在内部为sql添加偏移量和限制。

This is from the docs:
http://docs.djangoproject.com/en/dev/ref/models/querysets/

No database activity actually occurs until you do something to evaluate the queryset.

So when the print event is run the query fires (which is a full table scan according to your command.) and loads the results. Your asking for all the objects and there is no way to get the first object without getting all of them.

But if you do something like:

Event.objects.all()[300:900]

http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets

Then it will add offsets and limits to the sql internally.

网白 2024-10-10 03:17:05

在迭代查询集之前会消耗大量内存,因为整个查询的所有数据库行都会立即处理成对象,并且根据行数可能会进行大量处理。

您可以将查询集分成更小的易消化的位。我将这种模式称为“勺子喂养”。这是我在管理命令中使用的带有进度条的实现,首先 pip3 install tqdm

from tqdm import tqdm


def spoonfeed(qs, func, chunk=1000, start=0):
    """
    Chunk up a large queryset and run func on each item.

    Works with automatic primary key fields.

    chunk -- how many objects to take on at once
    start -- PK to start from

    >>> spoonfeed(Spam.objects.all(), nom_nom)
    """
    end = qs.order_by('pk').last()
    progressbar = tqdm(total=qs.count())
    if not end:
        return
    while start < end.pk:
        for o in qs.filter(pk__gt=start, pk__lte=start+chunk):
            func(o)
            progressbar.update(1)
        start += chunk
    progressbar.close()

要使用它,您需要编写一个对对象执行操作的函数:

def set_population(town):
    town.population = calculate_population(...)
    town.save()

然后在查询集上运行该函数:

spoonfeed(Town.objects.all(), set_population)

Massive amount of memory gets consumed before the queryset can be iterated because all database rows for a whole query get processed into objects at once and it can be a lot of processing depending on a number of rows.

You can chunk up your queryset into smaller digestible bits. I call the pattern to do this "spoonfeeding". Here's an implementation with a progress-bar I use in my management commands, first pip3 install tqdm

from tqdm import tqdm


def spoonfeed(qs, func, chunk=1000, start=0):
    """
    Chunk up a large queryset and run func on each item.

    Works with automatic primary key fields.

    chunk -- how many objects to take on at once
    start -- PK to start from

    >>> spoonfeed(Spam.objects.all(), nom_nom)
    """
    end = qs.order_by('pk').last()
    progressbar = tqdm(total=qs.count())
    if not end:
        return
    while start < end.pk:
        for o in qs.filter(pk__gt=start, pk__lte=start+chunk):
            func(o)
            progressbar.update(1)
        start += chunk
    progressbar.close()

To use this you write a function that does operations on your object:

def set_population(town):
    town.population = calculate_population(...)
    town.save()

and than run that function on your queryset:

spoonfeed(Town.objects.all(), set_population)
冷︶言冷语的世界 2024-10-10 03:17:05

这是一个包括 len 和 count 的解决方案:

class GeneratorWithLen(object):
    """
    Generator that includes len and count for given queryset
    """
    def __init__(self, generator, length):
        self.generator = generator
        self.length = length

    def __len__(self):
        return self.length

    def __iter__(self):
        return self.generator

    def __getitem__(self, item):
        return self.generator.__getitem__(item)

    def next(self):
        return next(self.generator)

    def count(self):
        return self.__len__()

def batch(queryset, batch_size=1024):
    """
    returns a generator that does not cache results on the QuerySet
    Aimed to use with expected HUGE/ENORMOUS data sets, no caching, no memory used more than batch_size

    :param batch_size: Size for the maximum chunk of data in memory
    :return: generator
    """
    total = queryset.count()

    def batch_qs(_qs, _batch_size=batch_size):
        """
        Returns a (start, end, total, queryset) tuple for each batch in the given
        queryset.
        """
        for start in range(0, total, _batch_size):
            end = min(start + _batch_size, total)
            yield (start, end, total, _qs[start:end])

    def generate_items():
        queryset.order_by()  # Clearing... ordering by id if PK autoincremental
        for start, end, total, qs in batch_qs(queryset):
            for item in qs:
                yield item

    return GeneratorWithLen(generate_items(), total)

用法:

events = batch(Event.objects.all())
len(events) == events.count()
for event in events:
    # Do something with the Event

Here a solution including len and count:

class GeneratorWithLen(object):
    """
    Generator that includes len and count for given queryset
    """
    def __init__(self, generator, length):
        self.generator = generator
        self.length = length

    def __len__(self):
        return self.length

    def __iter__(self):
        return self.generator

    def __getitem__(self, item):
        return self.generator.__getitem__(item)

    def next(self):
        return next(self.generator)

    def count(self):
        return self.__len__()

def batch(queryset, batch_size=1024):
    """
    returns a generator that does not cache results on the QuerySet
    Aimed to use with expected HUGE/ENORMOUS data sets, no caching, no memory used more than batch_size

    :param batch_size: Size for the maximum chunk of data in memory
    :return: generator
    """
    total = queryset.count()

    def batch_qs(_qs, _batch_size=batch_size):
        """
        Returns a (start, end, total, queryset) tuple for each batch in the given
        queryset.
        """
        for start in range(0, total, _batch_size):
            end = min(start + _batch_size, total)
            yield (start, end, total, _qs[start:end])

    def generate_items():
        queryset.order_by()  # Clearing... ordering by id if PK autoincremental
        for start, end, total, qs in batch_qs(queryset):
            for item in qs:
                yield item

    return GeneratorWithLen(generate_items(), total)

Usage:

events = batch(Event.objects.all())
len(events) == events.count()
for event in events:
    # Do something with the Event
聊慰 2024-10-10 03:17:05

我通常使用原始 MySQL 原始查询而不是 Django ORM 来完成此类任务。

MySQL支持流模式,因此我们可以安全快速地循环所有记录,而不会出现内存不足错误。

import MySQLdb
db_config = {}  # config your db here
connection = MySQLdb.connect(
        host=db_config['HOST'], user=db_config['USER'],
        port=int(db_config['PORT']), passwd=db_config['PASSWORD'], db=db_config['NAME'])
cursor = MySQLdb.cursors.SSCursor(connection)  # SSCursor for streaming mode
cursor.execute("SELECT * FROM event")
while True:
    record = cursor.fetchone()
    if record is None:
        break
    # Do something with record here

cursor.close()
connection.close()

参考:

  1. 从 MySQL 检索百万行
  2. MySQL 结果集流式传输与一次获取整个 JDBC 结果集的性能如何

I usually use raw MySQL raw query instead of Django ORM for this kind of task.

MySQL supports streaming mode so we can loop through all records safely and fast without out of memory error.

import MySQLdb
db_config = {}  # config your db here
connection = MySQLdb.connect(
        host=db_config['HOST'], user=db_config['USER'],
        port=int(db_config['PORT']), passwd=db_config['PASSWORD'], db=db_config['NAME'])
cursor = MySQLdb.cursors.SSCursor(connection)  # SSCursor for streaming mode
cursor.execute("SELECT * FROM event")
while True:
    record = cursor.fetchone()
    if record is None:
        break
    # Do something with record here

cursor.close()
connection.close()

Ref:

  1. Retrieving million of rows from MySQL
  2. How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文