Python发电机管道使用所有可用内存

发布于 2025-01-19 14:14:36 字数 4506 浏览 1 评论 0原文

我正在开发一个 cli 应用程序，该应用程序在光盘映像中搜索可能是 jpg 的字节字符串。该应用程序的核心是一个生成器管道，它打开光盘映像文件，将数据块读入缓冲区，搜索类似 jpg 的字节字符串，并将它们作为 jpg 保存到文件系统中。

完全使用生成器构建管道，我预计内存使用量会略高于用于读取输入文件的缓冲区的大小。

实际发生的情况是，它开始运行，吞噬所有可用的 RAM 和相当大的交换空间，并且该进程最终被终止。我一直在阅读并四处寻找，试图找出可能的原因，但很长一段时间后没有运气，这告诉我这可能是我没有注意到的明显事情。

以下是一些被剥离和连接的代码，但显示了相同的问题：

import os
import re
import sys

JPG_HEADER_PREFIX = b"\xff\xd8"
JPG_EOF = b"\xff\xd9"

FTYPES = ['jpg']
FILE_BOUNDS = {"jpg": (JPG_HEADER_PREFIX, JPG_EOF)}

DEFAULT_FILE_NAME = "img4G.iso"
DEFAULT_FILE_TYPE = "jpg"
DEFAULT_BATCH_SIZE = 2**28 # 256MB
dest_dir="."

def buffer(filename: str, batch_size: int=DEFAULT_BATCH_SIZE) -> bytes:
    """
        opens file "filename", reads bytes into buffer and yields bytes
    """
    with open(filename, 'rb') as f:
        while True:
            chunk = f.read(batch_size)
            if not chunk:
                break
            yield chunk

def lazy_match(chunk: bytes, file_type: str=DEFAULT_FILE_TYPE) -> bytes:
    """
        Takes buffer-full of bytes, yields byte strings
        of the form "SOI....+EOI" with no intervening SOIs or EOIs
    """
    header, eof = FILE_BOUNDS[file_type]
    file_pattern = b'%s(?:(?!%s)[\x00-\xff]){1000,}?%s' % (header, header, eof)
    matches = re.finditer(file_pattern, chunk)
    for m in matches:
        print("Size of m is: ", sys.getsizeof(m.group()))
        yield m.group()

def lazy_find_files(file: str=DEFAULT_FILE_NAME) -> bytes:
    for chunk in buffer(file):
        yield from lazy_match(chunk)


if __name__ == "__main__":
    from hashlib import md5
    import tracemalloc
    from pympler import muppy, summary

    tracemalloc.start(25)

    try:
        for f in lazy_find_files():
            dest_file = md5(f).hexdigest() + "." + DEFAULT_FILE_TYPE
            with open(os.path.join(dest_dir, dest_file), 'wb') as dest:
                dest.write(f)

    finally:
        mups = muppy.get_objects()
        snapshot = tracemalloc.take_snapshot()
        top_stats = snapshot.statistics('traceback')
        stat = top_stats[0]
        print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
        sumy = summary.summarize(mups)
        summary.print_(sumy)

以下是典型运行（键盘中断）的 pympler/tracemalloc 输出示例：

4 memory blocks: 8341738.8 KiB
                       types |   # objects |   total size
============================ | =========== | ============
                       bytes |         108 |    256.01 MB
                         str |       15416 |      3.04 MB
                        dict |        4656 |      1.78 MB
                        code |        5592 |    966.24 KB
                        type |         934 |    754.97 KB
                       tuple |        4733 |    272.88 KB
          wrapper_descriptor |        2231 |    156.87 KB
  builtin_function_or_method |        1502 |    105.61 KB
                         set |         131 |     93.63 KB
                        list |         465 |     92.73 KB
           method_descriptor |        1267 |     89.09 KB
                     weakref |        1260 |     88.59 KB
                 abc.ABCMeta |          87 |     85.66 KB
                   frozenset |         131 |     57.88 KB
           getset_descriptor |         897 |     56.06 KB

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/.../pypenador/__main__.py", line 43, in <module>
    for f in lazy_find_files(input_file):
  File "/.../pypenador/scrounge.py", line 77, in lazy_find_files
    yield from lazy_match(chunk, file_type=DEFAULT_FILE_TYPE)
  File "/.../pypenador/scrounge.py", line 71, in lazy_match
    for m in matches:
KeyboardInterrupt

打印 tracemalloc 时 按行号统计：

/home/.../example.py:36: size=10183 MiB, count=11, average=926 MiB
/home/.../example.py:22: size=256 MiB, count=1, average=256 MiB

其中 example.py:36 对应于 lazy_match 中的 for 循环标头：

for m in matches:

其中 matches = re.finditer(file_pattern, chunk)，表明问题与从 finditer 生成器读取有关。

提前致谢。

原文

I'm working on a cli application that searches through a disc image for byte strings that could be jpgs. The core of the application is a pipeline of generators that opens the disc image file, reads blocks of data into a buffer, searches for jpg-like byte strings, and saves them to the file system as jpgs.

Building the pipeline exclusively out of generators, I expected memory usage to be slightly higher than the size of the buffer used for reading the input file.

What actually happens is that it begins to run, devours all available RAM and a sizable chunk of swap space, and the process is eventually killed. I've been reading and picking around to try to find what might be the cause of this, but no luck after quite a while now, which tells me it's probably something obvious that I'm not noticing.

Here's some of the code stripped down and concatenated but showing the same problem:

import os
import re
import sys

JPG_HEADER_PREFIX = b"\xff\xd8"
JPG_EOF = b"\xff\xd9"

FTYPES = ['jpg']
FILE_BOUNDS = {"jpg": (JPG_HEADER_PREFIX, JPG_EOF)}

DEFAULT_FILE_NAME = "img4G.iso"
DEFAULT_FILE_TYPE = "jpg"
DEFAULT_BATCH_SIZE = 2**28 # 256MB
dest_dir="."

def buffer(filename: str, batch_size: int=DEFAULT_BATCH_SIZE) -> bytes:
    """
        opens file "filename", reads bytes into buffer and yields bytes
    """
    with open(filename, 'rb') as f:
        while True:
            chunk = f.read(batch_size)
            if not chunk:
                break
            yield chunk

def lazy_match(chunk: bytes, file_type: str=DEFAULT_FILE_TYPE) -> bytes:
    """
        Takes buffer-full of bytes, yields byte strings
        of the form "SOI....+EOI" with no intervening SOIs or EOIs
    """
    header, eof = FILE_BOUNDS[file_type]
    file_pattern = b'%s(?:(?!%s)[\x00-\xff]){1000,}?%s' % (header, header, eof)
    matches = re.finditer(file_pattern, chunk)
    for m in matches:
        print("Size of m is: ", sys.getsizeof(m.group()))
        yield m.group()

def lazy_find_files(file: str=DEFAULT_FILE_NAME) -> bytes:
    for chunk in buffer(file):
        yield from lazy_match(chunk)


if __name__ == "__main__":
    from hashlib import md5
    import tracemalloc
    from pympler import muppy, summary

    tracemalloc.start(25)

    try:
        for f in lazy_find_files():
            dest_file = md5(f).hexdigest() + "." + DEFAULT_FILE_TYPE
            with open(os.path.join(dest_dir, dest_file), 'wb') as dest:
                dest.write(f)

    finally:
        mups = muppy.get_objects()
        snapshot = tracemalloc.take_snapshot()
        top_stats = snapshot.statistics('traceback')
        stat = top_stats[0]
        print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
        sumy = summary.summarize(mups)
        summary.print_(sumy)

Here's example pympler/tracemalloc output from a typical run (keyboard interrupted):

4 memory blocks: 8341738.8 KiB
                       types |   # objects |   total size
============================ | =========== | ============
                       bytes |         108 |    256.01 MB
                         str |       15416 |      3.04 MB
                        dict |        4656 |      1.78 MB
                        code |        5592 |    966.24 KB
                        type |         934 |    754.97 KB
                       tuple |        4733 |    272.88 KB
          wrapper_descriptor |        2231 |    156.87 KB
  builtin_function_or_method |        1502 |    105.61 KB
                         set |         131 |     93.63 KB
                        list |         465 |     92.73 KB
           method_descriptor |        1267 |     89.09 KB
                     weakref |        1260 |     88.59 KB
                 abc.ABCMeta |          87 |     85.66 KB
                   frozenset |         131 |     57.88 KB
           getset_descriptor |         897 |     56.06 KB

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/.../pypenador/__main__.py", line 43, in <module>
    for f in lazy_find_files(input_file):
  File "/.../pypenador/scrounge.py", line 77, in lazy_find_files
    yield from lazy_match(chunk, file_type=DEFAULT_FILE_TYPE)
  File "/.../pypenador/scrounge.py", line 71, in lazy_match
    for m in matches:
KeyboardInterrupt

When printing tracemalloc's statistics by line number:

/home/.../example.py:36: size=10183 MiB, count=11, average=926 MiB
/home/.../example.py:22: size=256 MiB, count=1, average=256 MiB

where example.py:36 corresponds to the for-loop header in lazy_match:

for m in matches:

where matches = re.finditer(file_pattern, chunk), suggesting that the problem is related to reading from the finditer generator.

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

梦罢

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

Python发电机管道使用所有可用内存

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

Python发电机管道使用所有可用内存

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。