使用 Django 根据记录中的字段高效更新大量记录
我有大约一百万条 Comment
记录,我想根据该评论的 body
字段更新这些记录。我正在尝试找出如何有效地做到这一点。现在,我的方法如下所示:
update_list = []
qs = Comments.objects.filter(word_count=0)
for comment in qs:
model_obj = Comments.objects.get(id=comment.id)
model_obj.word_count = len(model_obj.body.split())
update_list.append(model_obj)
Comment.objects.bulk_update(update_list, ['word_count'])
但是,这在我的迁移过程中挂起并且似乎超时。有人对我如何实现这一目标有建议吗?
I have about a million Comment
records that I want to update based on that comment's body
field. I'm trying to figure out how to do this efficiently. Right now, my approach looks like this:
update_list = []
qs = Comments.objects.filter(word_count=0)
for comment in qs:
model_obj = Comments.objects.get(id=comment.id)
model_obj.word_count = len(model_obj.body.split())
update_list.append(model_obj)
Comment.objects.bulk_update(update_list, ['word_count'])
However, this hangs and seems to time out in my migration process. Does anybody have suggestions on how I can accomplish this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
确定 Django 对象的内存占用量并不容易,但绝对最小值是存储其所有数据所需的空间量。我的猜测是,您可能会耗尽内存并出现页面抖动。
您可能希望一次批量处理(例如 1000 个对象)。使用 查询集切片,它会返回另一个查询集。尝试类似的方法,
当循环中的每个行程替换batch_qs和update_list时,都会释放前一个行程占用的空间。打印声明将允许您以希望可接受的正常速度观看它的进展!
警告 - 我从未尝试过这个。我还想知道切片和过滤是否可以很好地配合使用,或者是否应该使用它们
,以便您可以对整个数据库表中的行进行切片并检索需要更新的每个切片的子集。这感觉“更安全”。有人确切知道吗?
It's not easy to determine the memory footprint of a Django object, but an absolute minimum is the amount of space needed to store all of its data. My guess is that you may be running out of memory and page-thrashing.
You probably want to work in batches of, say, 1000 objects at a time. Use Queryset slicing, which returns another queryset. Try something like
Each trip around the loop will free the space occupied by the previous trip when it replaces batch_qs and update_list. The print statement will allow you to watch it progress at a hopefully acceptable, regular rate!
Warning - I have never tried this. I'm also wondering whether slicing and filtering will play nice with each other or whether one should use
so you are slicing your way though rows in the entire DB table and retrieving a subset of each slice that needs updating. This feels "safer". Anybody know for sure?