大型 django 表上的高效数据迁移

发布于 2024-11-14 01:42:51 字数 760 浏览 4 评论 0原文

我需要向一个大（5m 行）django 表添加一个新列。我有一个创建新列的南schemamigration。现在我正在编写一个 datamigration 脚本来填充新列。看起来像这样。（如果您不熟悉向南迁移，请忽略模型名称前缀的 orm.。）

print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():            
    if cnt % 500 == 0:
        print "    %s done so far" % cnt
    # article.newfield = calculate_newfield(article)
    article.save()
    cnt += 1

我从 objects.all 切换到 objects.iterator< /code> 减少内存需求。但是当我运行这个脚本时，有些东西仍然在消耗大量内存。即使如上所述注释掉了实际有用的行，脚本仍然会增长到使用 10+ GB 的内存，然后才远远超出表格，我放弃了它。

似乎有什么东西在内存中保留着这些对象。我怎样才能运行它，这样它就不会占用内存？

FWIW，我正在使用 python 2.6、django 1.2.1、south 0.7.2、mysql 5.1。

原文

I need to add a new column to a large (5m row) django table. I have a south schemamigration that creates the new column. Now I'm writing a datamigration script to populate the new column. It looks like this. (If you're not familiar with south migrations, just ignore the orm. prefixing the model name.)

print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():            
    if cnt % 500 == 0:
        print "    %s done so far" % cnt
    # article.newfield = calculate_newfield(article)
    article.save()
    cnt += 1

I switched from objects.all to objects.iterator to reduce memory requirements. But something is still chewing up vast memory when I run this script. Even with the actually useful line commented out as above, the script still grows to using 10+ GB of ram before getting very far through the table and I give up on it.

Seems like something is holding on to these objects in memory. How can I run this so it's not a memory hog?

FWIW, I'm using python 2.6, django 1.2.1, south 0.7.2, mysql 5.1.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

茶花眉 2024-11-21 01:42:52

如果您不需要对对象的完全访问权限，则始终可以在查询集上使用 only 和 values 或 values_list 的组合。这应该有助于显着减少内存需求，但我不确定它是否足够。

回复收藏 0 原文

活雷疯 2024-11-21 01:42:51

确保将 settings.DEBUG 设置为 False。 DEBUG=True 会填充内存，尤其是在数据库密集型操作中，因为它将发送到 RDBMS 的所有查询存储在视图中。

在 Django 1.8 中，应该没有必要这样做了，因为现在存储了最多 9000 个查询的硬编码，而不是之前的无限数量。

回复收藏 0 原文

蓝眸 2024-11-21 01:42:51

欢迎来到 Django 的 ORM。我认为这是一个固有的问题。

我也遇到过大型数据库、dumpdata、loaddata 等问题。

你有两个选择。

停止尝试使用 South 并编写自己的 ORM 迁移。您的设置中可以有多个数据库定义。创造“旧”和“新”。编写您自己的从旧数据库到新数据库的一次性迁移器。一旦经过测试并正常工作，请最后运行一次，然后切换数据库定义并重新启动 Django。
抛弃 South 和 ORM，编写您自己的 SQL 迁移。使用原始 SQL 将数据从旧结构复制到新结构。单独调试。当一切正常时，最后运行一次，然后切换设置并重新启动 Django。

并不是说南方或 ORM 特别糟糕。但是，对于大型数据库中的批量处理，它们在内存中缓存了太多内容。

回复收藏 0 原文

策马西风 2024-11-21 01:42:51

或者，如果您就地创建一个原始查询来实现基本结果集大小限制，会发生什么？

啦： https://docs.djangoproject.com/en /1.3/topics/db/sql/#index-lookups

while min < rowcount:
  min += 500
  max = min + 500
  articles = Article.objects.raw('SELECT * from article where id > %s and id < %s' % (min, max))
  for old_article in articles:
    # create the new article
    article.save()

Or, what happens if you create a raw query in situ which implements a rudimentary resultset size limit?

a la: https://docs.djangoproject.com/en/1.3/topics/db/sql/#index-lookups

while min < rowcount:
  min += 500
  max = min + 500
  articles = Article.objects.raw('SELECT * from article where id > %s and id < %s' % (min, max))
  for old_article in articles:
    # create the new article
    article.save()

回复收藏 0 原文

歌枕肩 2024-11-21 01:42:51

orm.Article.objects.iterator()

是否运行整个查询并将结果保存在内存中？或者一次从数据库中获取一行？

我猜它会同时完成这一切。看看是否可以用以增量方式提取数据的数据库游标替换该循环：

例如： http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany

db = blah.connect("host='%s' dbname='%s' user='%s' password='%s'" % ...
new, old = db.cursor(), db.cursor()
old.execute("""
    SELECT  *
    FROM    whatever
""")
for row in old.fetchmany(size=500):
    (col1, col2, col3...) = row
    new = db.cursor()
    new.execute("""
        INSERT INTO yourtable (
            col1, col2, col3...)
        VALUES (
            %s, %s, %s, %s, %s)
        """,(col1, col2, col3,...))
new.close()
old.close()

会很慢。我从我的独立迁移脚本中提取了这个，所以 ymmv。

fetchmany 是标准的（PEP249）。我还没有完全完成你正在寻找的事情，所以这个示例还有一些工作要做：我没有循环遍历循环 - 获得 500 组直到完成 - 所以你需要解决这个问题为了你自己。

orm.Article.objects.iterator()

Does that run the whole query and save the result in memory? Or fetch rows from the database one at a time?

I'm guessing it does it all at once. See if you can replace that loop with a database cursor that pulls the data in an incremental fashion:

eg: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany

db = blah.connect("host='%s' dbname='%s' user='%s' password='%s'" % ...
new, old = db.cursor(), db.cursor()
old.execute("""
    SELECT  *
    FROM    whatever
""")
for row in old.fetchmany(size=500):
    (col1, col2, col3...) = row
    new = db.cursor()
    new.execute("""
        INSERT INTO yourtable (
            col1, col2, col3...)
        VALUES (
            %s, %s, %s, %s, %s)
        """,(col1, col2, col3,...))
new.close()
old.close()

It will be slow. I pulled this from a standalone migration script of mine so ymmv.

fetchmany is standard (PEP249). I've not done exactly what you're looking for so there's a little work still to do from this sample: I've not looped over the loop - to get sets of 500 till done - so you'll need to work that out for yourself.

回复收藏 0 原文

~没有更多了~