如何提高Python脚本的内存效率

发布于 2024-11-09 01:57:13 字数 1793 浏览 4 评论 0原文

此代码片段从我的数据库中提取所有文档,并将它们转储到 gzip 压缩文件中。 docs_to_dump 是一个 django 对象,包含所有要转储的文本文档。

os.chdir(dump_dir)
filename = 'latest-' + court_id + '.xml.gz.part'
with myGzipFile(filename, mode='wb') as z_file:
    z_file.write('<?xml version="1.0" encoding="utf-8"?>\n<opinions dumpdate="' + str(date.today()) + '">\n')

    for doc in docs_to_dump:
        row = etree.Element("opinion",
            dateFiled           = str(doc.dateFiled),
            precedentialStatus  = doc.documentType,
            local_path          = str(doc.local_path),
            time_retrieved      = str(doc.time_retrieved),
            download_URL        = doc.download_URL,
            caseNumber          = doc.citation.caseNumber,
            caseNameShort       = doc.citation.caseNameShort,
            court               = doc.court.get_courtUUID_display(),
            sha1                = doc.documentSHA1,
            source              = doc.get_source_display(),
            id                  = str(doc.documentUUID),
        )
        if doc.documentHTML != '':
            row.text = doc.documentHTML
        else:
            row.text = doc.documentPlainText.translate(null_map)
        z_file.write('  ' + etree.tostring(row).encode('utf-8') + '\n')

    # Close things off
    z_file.write('</opinions>')

不幸的是,它也消耗了太多的内存,以至于操作系统对其进行了攻击。我认为通过写入“类似文件的对象”,压缩文件将即时生成,并且内存将保持相对较低。相反,它占用了数百 MB,然后崩溃了。

我不是压缩方面的专家,但我的印象是整个压缩文件都存储在内存中。

我应该有更好的方法吗?

编辑 - 整个文件在这里: https://bitbucket.org/mlissner/search-and-awareness-platform-courtlistener/src/2ca68efd8017/data-dumps/data-dumper.py

This snippet pulls all of the docs out of my database, and dumps them into a gzip-compressed file. docs_to_dump is a django object containing all of the text documents to be dumped.

os.chdir(dump_dir)
filename = 'latest-' + court_id + '.xml.gz.part'
with myGzipFile(filename, mode='wb') as z_file:
    z_file.write('<?xml version="1.0" encoding="utf-8"?>\n<opinions dumpdate="' + str(date.today()) + '">\n')

    for doc in docs_to_dump:
        row = etree.Element("opinion",
            dateFiled           = str(doc.dateFiled),
            precedentialStatus  = doc.documentType,
            local_path          = str(doc.local_path),
            time_retrieved      = str(doc.time_retrieved),
            download_URL        = doc.download_URL,
            caseNumber          = doc.citation.caseNumber,
            caseNameShort       = doc.citation.caseNameShort,
            court               = doc.court.get_courtUUID_display(),
            sha1                = doc.documentSHA1,
            source              = doc.get_source_display(),
            id                  = str(doc.documentUUID),
        )
        if doc.documentHTML != '':
            row.text = doc.documentHTML
        else:
            row.text = doc.documentPlainText.translate(null_map)
        z_file.write('  ' + etree.tostring(row).encode('utf-8') + '\n')

    # Close things off
    z_file.write('</opinions>')

Unfortunately, it also consumes so much memory that the OS nukes it. I thought that by writing to a "File-like object", the compressed file would get made on the fly, and that memory would remain relatively low. Instead, it's taking up hundreds of MB, then crashing.

I'm not an expert on compression, but my impression is that the whole compressed file is getting stored in memory.

Is there a better way I ought to be doing this?

EDIT -- The whole file is here: https://bitbucket.org/mlissner/search-and-awareness-platform-courtlistener/src/2ca68efd8017/data-dumps/data-dumper.py

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

鱼忆七猫命九 2024-11-16 01:57:13

我认为安德鲁斯基可能是对的。如果崩溃,请尝试调整查询以使用 迭代器 方法

之类的。

docs_to_dump = Document.objects.all().order_by('court').iterator()

应该避免将整个查询集加载到内存中。

I think andrewski might be right. If you are crashing, try adjusting your Query's to use the iterator method

Something like.

docs_to_dump = Document.objects.all().order_by('court').iterator()

Should keep from loading your entire Queryset into memory.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文