大型 django 表上的高效数据迁移
我需要向一个大(5m 行)django 表添加一个新列。我有一个创建新列的南schemamigration
。现在我正在编写一个 datamigration
脚本来填充新列。看起来像这样。 (如果您不熟悉向南迁移,请忽略模型名称前缀的 orm.
。)
print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():
if cnt % 500 == 0:
print " %s done so far" % cnt
# article.newfield = calculate_newfield(article)
article.save()
cnt += 1
我从 objects.all
切换到 objects.iterator< /code> 减少内存需求。但是当我运行这个脚本时,有些东西仍然在消耗大量内存。即使如上所述注释掉了实际有用的行,脚本仍然会增长到使用 10+ GB 的内存,然后才远远超出表格,我放弃了它。
似乎有什么东西在内存中保留着这些对象。我怎样才能运行它,这样它就不会占用内存?
FWIW,我正在使用 python 2.6、django 1.2.1、south 0.7.2、mysql 5.1。
I need to add a new column to a large (5m row) django table. I have a south schemamigration
that creates the new column. Now I'm writing a datamigration
script to populate the new column. It looks like this. (If you're not familiar with south migrations, just ignore the orm.
prefixing the model name.)
print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():
if cnt % 500 == 0:
print " %s done so far" % cnt
# article.newfield = calculate_newfield(article)
article.save()
cnt += 1
I switched from objects.all
to objects.iterator
to reduce memory requirements. But something is still chewing up vast memory when I run this script. Even with the actually useful line commented out as above, the script still grows to using 10+ GB of ram before getting very far through the table and I give up on it.
Seems like something is holding on to these objects in memory. How can I run this so it's not a memory hog?
FWIW, I'm using python 2.6, django 1.2.1, south 0.7.2, mysql 5.1.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您不需要对对象的完全访问权限,则始终可以在查询集上使用
only
和values
或values_list
的组合。这应该有助于显着减少内存需求,但我不确定它是否足够。If you don't need full access to the objects, you can always use a combo of
only
andvalues
orvalues_list
on your queryset. That should help reduce the memory requirements significantly, but I'm not sure whether it will be enough.确保将
settings.DEBUG
设置为False
。DEBUG=True
会填充内存,尤其是在数据库密集型操作中,因为它将发送到 RDBMS 的所有查询存储在视图中。在 Django 1.8 中,应该没有必要这样做了,因为现在存储了最多 9000 个查询的硬编码,而不是之前的无限数量。
Ensure
settings.DEBUG
is set toFalse
.DEBUG=True
fills memory especially with database intensive operations, since it stores all queries sent to the RDBMS within a view.With Django 1.8 out, it should not be necessary since a hardcoded max of 9000 queries are now stored, instead of an infinite number before.
欢迎来到 Django 的 ORM。我认为这是一个固有的问题。
我也遇到过大型数据库、dumpdata、loaddata 等问题。
你有两个选择。
停止尝试使用 South 并编写自己的 ORM 迁移。您的设置中可以有多个数据库定义。创造“旧”和“新”。编写您自己的从旧数据库到新数据库的一次性迁移器。一旦经过测试并正常工作,请最后运行一次,然后切换数据库定义并重新启动 Django。
抛弃 South 和 ORM,编写您自己的 SQL 迁移。使用原始 SQL 将数据从旧结构复制到新结构。单独调试。当一切正常时,最后运行一次,然后切换设置并重新启动 Django。
并不是说南方或 ORM 特别糟糕。但是,对于大型数据库中的批量处理,它们在内存中缓存了太多内容。
Welcome to Django's ORM. I think this is an inherent problem.
I've also had problems with large databases, dumpdata, loaddata and the like.
You have two choices.
Stop trying to use south and write your own ORM migration. You can have multiple database definitions in your settings. Create "old" and "new". Write your own one-time migrator from the old database to the new database. Once that's tested and works, run it one final time and then switch the database definitions and restart Django.
Ditch south and the ORM and write your own SQL migration. Use raw SQL to copy data out of the old structure to the new structure. Debug separately. When it's good, run it one final time and then switch your setting and restart Django.
It's not that south or the ORM are particularly bad. But, for bulk processing in large databases, they cache too much in memory.
或者,如果您就地创建一个原始查询来实现基本结果集大小限制,会发生什么?
啦: https://docs.djangoproject.com/en /1.3/topics/db/sql/#index-lookups
Or, what happens if you create a raw query in situ which implements a rudimentary resultset size limit?
a la: https://docs.djangoproject.com/en/1.3/topics/db/sql/#index-lookups
orm.Article.objects.iterator()
是否运行整个查询并将结果保存在内存中?或者一次从数据库中获取一行?
我猜它会同时完成这一切。看看是否可以用以增量方式提取数据的数据库游标替换该循环:
例如: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany
会很慢。我从我的独立迁移脚本中提取了这个,所以 ymmv。
fetchmany 是标准的(PEP249)。我还没有完全完成你正在寻找的事情,所以这个示例还有一些工作要做:我没有循环遍历循环 - 获得 500 组直到完成 - 所以你需要解决这个问题为了你自己。
orm.Article.objects.iterator()
Does that run the whole query and save the result in memory? Or fetch rows from the database one at a time?
I'm guessing it does it all at once. See if you can replace that loop with a database cursor that pulls the data in an incremental fashion:
eg: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany
It will be slow. I pulled this from a standalone migration script of mine so ymmv.
fetchmany is standard (PEP249). I've not done exactly what you're looking for so there's a little work still to do from this sample: I've not looped over the loop - to get sets of 500 till done - so you'll need to work that out for yourself.