如何提高Python脚本的内存效率

发布于 2024-11-09 01:57:13 字数 1793 浏览 4 评论 0原文

此代码片段从我的数据库中提取所有文档，并将它们转储到 gzip 压缩文件中。 docs_to_dump 是一个 django 对象，包含所有要转储的文本文档。

os.chdir(dump_dir)
filename = 'latest-' + court_id + '.xml.gz.part'
with myGzipFile(filename, mode='wb') as z_file:
    z_file.write('<?xml version="1.0" encoding="utf-8"?>\n<opinions dumpdate="' + str(date.today()) + '">\n')

    for doc in docs_to_dump:
        row = etree.Element("opinion",
            dateFiled           = str(doc.dateFiled),
            precedentialStatus  = doc.documentType,
            local_path          = str(doc.local_path),
            time_retrieved      = str(doc.time_retrieved),
            download_URL        = doc.download_URL,
            caseNumber          = doc.citation.caseNumber,
            caseNameShort       = doc.citation.caseNameShort,
            court               = doc.court.get_courtUUID_display(),
            sha1                = doc.documentSHA1,
            source              = doc.get_source_display(),
            id                  = str(doc.documentUUID),
        )
        if doc.documentHTML != '':
            row.text = doc.documentHTML
        else:
            row.text = doc.documentPlainText.translate(null_map)
        z_file.write('  ' + etree.tostring(row).encode('utf-8') + '\n')

    # Close things off
    z_file.write('</opinions>')

不幸的是，它也消耗了太多的内存，以至于操作系统对其进行了攻击。我认为通过写入“类似文件的对象”，压缩文件将即时生成，并且内存将保持相对较低。相反，它占用了数百 MB，然后崩溃了。

我不是压缩方面的专家，但我的印象是整个压缩文件都存储在内存中。

我应该有更好的方法吗？

编辑 - 整个文件在这里： https://bitbucket.org/mlissner/search-and-awareness-platform-courtlistener/src/2ca68efd8017/data-dumps/data-dumper.py

原文

This snippet pulls all of the docs out of my database, and dumps them into a gzip-compressed file. docs_to_dump is a django object containing all of the text documents to be dumped.

os.chdir(dump_dir)
filename = 'latest-' + court_id + '.xml.gz.part'
with myGzipFile(filename, mode='wb') as z_file:
    z_file.write('<?xml version="1.0" encoding="utf-8"?>\n<opinions dumpdate="' + str(date.today()) + '">\n')

    for doc in docs_to_dump:
        row = etree.Element("opinion",
            dateFiled           = str(doc.dateFiled),
            precedentialStatus  = doc.documentType,
            local_path          = str(doc.local_path),
            time_retrieved      = str(doc.time_retrieved),
            download_URL        = doc.download_URL,
            caseNumber          = doc.citation.caseNumber,
            caseNameShort       = doc.citation.caseNameShort,
            court               = doc.court.get_courtUUID_display(),
            sha1                = doc.documentSHA1,
            source              = doc.get_source_display(),
            id                  = str(doc.documentUUID),
        )
        if doc.documentHTML != '':
            row.text = doc.documentHTML
        else:
            row.text = doc.documentPlainText.translate(null_map)
        z_file.write('  ' + etree.tostring(row).encode('utf-8') + '\n')

    # Close things off
    z_file.write('</opinions>')

Unfortunately, it also consumes so much memory that the OS nukes it. I thought that by writing to a "File-like object", the compressed file would get made on the fly, and that memory would remain relatively low. Instead, it's taking up hundreds of MB, then crashing.

I'm not an expert on compression, but my impression is that the whole compressed file is getting stored in memory.

Is there a better way I ought to be doing this?

EDIT -- The whole file is here: https://bitbucket.org/mlissner/search-and-awareness-platform-courtlistener/src/2ca68efd8017/data-dumps/data-dumper.py

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鱼忆七猫命九 2024-11-16 01:57:13

我认为安德鲁斯基可能是对的。如果崩溃，请尝试调整查询以使用迭代器方法

之类的。

docs_to_dump = Document.objects.all().order_by('court').iterator()

应该避免将整个查询集加载到内存中。

I think andrewski might be right. If you are crashing, try adjusting your Query's to use the iterator method

Something like.

docs_to_dump = Document.objects.all().order_by('court').iterator()

Should keep from loading your entire Queryset into memory.

回复收藏 0 原文

~没有更多了~

关于作者

樱花细雨

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

如何提高Python脚本的内存效率

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

初遇

听闻余生

Z_dy

左岸枫

1848719402

婷

友情链接

如何提高Python脚本的内存效率

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

初遇

听闻余生

Z_dy

左岸枫

1848719402

婷

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。