使用 Django 根据记录中的字段高效更新大量记录

发布于 2025-01-10 05:45:34 字数 454 浏览 0 评论 0原文

我有大约一百万条 Comment 记录，我想根据该评论的 body 字段更新这些记录。我正在尝试找出如何有效地做到这一点。现在，我的方法如下所示：

update_list = []
qs = Comments.objects.filter(word_count=0)
for comment in qs:
    model_obj = Comments.objects.get(id=comment.id)
    model_obj.word_count = len(model_obj.body.split())
    update_list.append(model_obj)
Comment.objects.bulk_update(update_list, ['word_count'])

但是，这在我的迁移过程中挂起并且似乎超时。有人对我如何实现这一目标有建议吗？

原文

I have about a million Comment records that I want to update based on that comment's body field. I'm trying to figure out how to do this efficiently. Right now, my approach looks like this:

update_list = []
qs = Comments.objects.filter(word_count=0)
for comment in qs:
    model_obj = Comments.objects.get(id=comment.id)
    model_obj.word_count = len(model_obj.body.split())
    update_list.append(model_obj)
Comment.objects.bulk_update(update_list, ['word_count'])

However, this hangs and seems to time out in my migration process. Does anybody have suggestions on how I can accomplish this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

请恋爱 2025-01-17 05:45:34

确定 Django 对象的内存占用量并不容易，但绝对最小值是存储其所有数据所需的空间量。我的猜测是，您可能会耗尽内存并出现页面抖动。

您可能希望一次批量处理（例如 1000 个对象）。使用查询集切片，它会返回另一个查询集。尝试类似的方法，

BATCH_SIZE = 1000 
start = 0
base_qs = Comments.objects.filter(word_count=0)

while True:
    batch_qs = base_qs[ start: start+BATCH_SIZE ]
    start += BATCH_SIZE
    if not batch_qs.exists():
        break

    update_list = []
    for comment in batch_qs:
        model_obj = Comments.objects.get(id=comment.id)
        model_obj.word_count = len(model_obj.body.split())
        update_list.append(model_obj)
    Comment.objects.bulk_update(update_list, ['word_count'])
    print( f'Processed batch starting at {start}' )

当循环中的每个行程替换batch_qs和update_list时，都会释放前一个行程占用的空间。打印声明将允许您以希望可接受的正常速度观看它的进展！

警告 - 我从未尝试过这个。我还想知道切片和过滤是否可以很好地配合使用，或者是否应该使用它们

base_qs = Comments.objects.all()
...
while True:
    batch_qs = base_qs[ start: start+BATCH_SIZE ]
    ....
    for comment in batch_qs.filter(word_count=0) :

，以便您可以对整个数据库表中的行进行切片并检索需要更新的每个切片的子集。这感觉“更安全”。有人确切知道吗？

It's not easy to determine the memory footprint of a Django object, but an absolute minimum is the amount of space needed to store all of its data. My guess is that you may be running out of memory and page-thrashing.

You probably want to work in batches of, say, 1000 objects at a time. Use Queryset slicing, which returns another queryset. Try something like

BATCH_SIZE = 1000 
start = 0
base_qs = Comments.objects.filter(word_count=0)

while True:
    batch_qs = base_qs[ start: start+BATCH_SIZE ]
    start += BATCH_SIZE
    if not batch_qs.exists():
        break

    update_list = []
    for comment in batch_qs:
        model_obj = Comments.objects.get(id=comment.id)
        model_obj.word_count = len(model_obj.body.split())
        update_list.append(model_obj)
    Comment.objects.bulk_update(update_list, ['word_count'])
    print( f'Processed batch starting at {start}' )

Each trip around the loop will free the space occupied by the previous trip when it replaces batch_qs and update_list. The print statement will allow you to watch it progress at a hopefully acceptable, regular rate!

Warning - I have never tried this. I'm also wondering whether slicing and filtering will play nice with each other or whether one should use

base_qs = Comments.objects.all()
...
while True:
    batch_qs = base_qs[ start: start+BATCH_SIZE ]
    ....
    for comment in batch_qs.filter(word_count=0) :

so you are slicing your way though rows in the entire DB table and retrieving a subset of each slice that needs updating. This feels "safer". Anybody know for sure?

回复收藏 0 原文

~没有更多了~