为什么这个 python 循环会泄漏内存?

发布于 2024-08-20 03:40:22 字数 1651 浏览 7 评论 0原文

我正在编写一个自定义文件系统爬虫,它通过 sys.stdin 传递数百万个 glob 来进行处理。我发现运行脚本时,其内存使用量随着时间的推移而大幅增加,整个过程几乎停止了。我在下面写了一个最小的案例来说明问题。我是否做错了什么,或者我在 Python / glob 模块中发现了错误? (我使用的是python 2.5.2)。


#!/usr/bin/env python
import glob
import sys
import gc

previous_num_objects = 0

for count, line in enumerate(sys.stdin):
   glob_result = glob.glob(line.rstrip('\n'))
   current_num_objects = len(gc.get_objects())
   new_objects = current_num_objects - previous_num_objects

   print "(%d) This: %d, New: %d, Garbage: %d, Collection Counts: %s"\
 % (count, current_num_objects, new_objects, len(gc.garbage), gc.get_count())
   previous_num_objects = current_num_objects

输出如下:

(0) This: 4042, New: 4042, Python Garbage: 0, Python Collection Counts: (660, 5, 0)
(1) This: 4061, New: 19, Python Garbage: 0, Python Collection Counts: (90, 6, 0)
(2) This: 4064, New: 3, Python Garbage: 0, Python Collection Counts: (127, 6, 0)
(3) This: 4067, New: 3, Python Garbage: 0, Python Collection Counts: (130, 6, 0)
(4) This: 4070, New: 3, Python Garbage: 0, Python Collection Counts: (133, 6, 0)
(5) This: 4073, New: 3, Python Garbage: 0, Python Collection Counts: (136, 6, 0)
(6) This: 4076, New: 3, Python Garbage: 0, Python Collection Counts: (139, 6, 0)
(7) This: 4079, New: 3, Python Garbage: 0, Python Collection Counts: (142, 6, 0)
(8) This: 4082, New: 3, Python Garbage: 0, Python Collection Counts: (145, 6, 0)
(9) This: 4085, New: 3, Python Garbage: 0, Python Collection Counts: (148, 6, 0)

每第 100 次迭代,就会释放 100 个对象,因此每 100 次迭代 len(gc.get_objects() 就会增加 200。len(gc.garbage) 从不从0开始变化。第2代收集计数缓慢增加,而第0代和第1代计数则上下波动。

I am writing a custom file system crawler, which gets passed millions of globs to process through sys.stdin. I'm finding that when running the script, its memory usage increases massively over time and the whole thing crawls practically to a halt. I've written a minimal case below which shows the problem. Am I doing something wrong, or have I found a bug in Python / the glob module? (I am using python 2.5.2).


#!/usr/bin/env python
import glob
import sys
import gc

previous_num_objects = 0

for count, line in enumerate(sys.stdin):
   glob_result = glob.glob(line.rstrip('\n'))
   current_num_objects = len(gc.get_objects())
   new_objects = current_num_objects - previous_num_objects

   print "(%d) This: %d, New: %d, Garbage: %d, Collection Counts: %s"\
 % (count, current_num_objects, new_objects, len(gc.garbage), gc.get_count())
   previous_num_objects = current_num_objects

The output looks like:

(0) This: 4042, New: 4042, Python Garbage: 0, Python Collection Counts: (660, 5, 0)
(1) This: 4061, New: 19, Python Garbage: 0, Python Collection Counts: (90, 6, 0)
(2) This: 4064, New: 3, Python Garbage: 0, Python Collection Counts: (127, 6, 0)
(3) This: 4067, New: 3, Python Garbage: 0, Python Collection Counts: (130, 6, 0)
(4) This: 4070, New: 3, Python Garbage: 0, Python Collection Counts: (133, 6, 0)
(5) This: 4073, New: 3, Python Garbage: 0, Python Collection Counts: (136, 6, 0)
(6) This: 4076, New: 3, Python Garbage: 0, Python Collection Counts: (139, 6, 0)
(7) This: 4079, New: 3, Python Garbage: 0, Python Collection Counts: (142, 6, 0)
(8) This: 4082, New: 3, Python Garbage: 0, Python Collection Counts: (145, 6, 0)
(9) This: 4085, New: 3, Python Garbage: 0, Python Collection Counts: (148, 6, 0)

Every 100th iteration, 100 objects are freed, so len(gc.get_objects() increases by 200 every 100 iterations. len(gc.garbage) never changes from 0. The 2nd generation collection count increases slowly, while the 0th and 1st counts go up and down.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

只有影子陪我不离不弃 2024-08-27 03:40:22

我追踪到了 fnmatch 模块。 glob.glob 调用 fnmatch 来实际执行通配符,并且 fnmatch 有一个永远不会被清除的正则表达式缓存。因此,在这种用法中,缓存不断增长且不受控制。我已经针对 fnmatch 库 [1] 提交了一个错误。

[1]: http://bugs.python.org/issue7846 Python 错误

I tracked this down to the fnmatch module. glob.glob calls fnmatch to actually perform the globbing, and fnmatch has a cache of regular expressions which is never cleared. So in this usage, the cache was growing continuously and unchecked. I've filed a bug against the fnmatch library [1].

[1]: http://bugs.python.org/issue7846 Python Bug

不知所踪 2024-08-27 03:40:22

我无法在我的系统上重现任何实际泄漏,但我认为您的“每 100 次迭代,释放 100 个对象”是指您访问已编译正则表达式的缓存(通过 glob 模块)。如果您查看 re.py,您会看到 _MAXCACHE 默认为 100,并且默认情况下,一旦您达到该值(在 _compile 中),整个缓存就会被清除。如果您在调用 gc 之前调用 re.purge(),您可能会看到该效果消失。

(请注意,我只是建议在这里使用 re.purge() 来检查缓存是否影响您的 gc 结果。在您的实际代码中没有必要这样做。)

我怀疑这是否可以修复您的大量问题不过内存增加的问题。

I cannot reproduce any actual leak on my system, but I think your "every 100th iteration, 100 objects are freed" is you hitting the cache for compiled regular expressions (via the glob module). If you peek at re.py you'll see _MAXCACHE defaults to 100, and by default the entire cache is blown away once you hit that (in _compile). If you call re.purge() before your gc calls you will probably see that effect go away.

(note I'm only suggesting re.purge() here to check that cache is affecting your gc results. It should not be necessary to have that in your actual code.)

I doubt that fixes your massive memory increase problem though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文