为什么这个 python 循环会泄漏内存?
我正在编写一个自定义文件系统爬虫,它通过 sys.stdin 传递数百万个 glob 来进行处理。我发现运行脚本时,其内存使用量随着时间的推移而大幅增加,整个过程几乎停止了。我在下面写了一个最小的案例来说明问题。我是否做错了什么,或者我在 Python / glob 模块中发现了错误? (我使用的是python 2.5.2)。
#!/usr/bin/env python
import glob
import sys
import gc
previous_num_objects = 0
for count, line in enumerate(sys.stdin):
glob_result = glob.glob(line.rstrip('\n'))
current_num_objects = len(gc.get_objects())
new_objects = current_num_objects - previous_num_objects
print "(%d) This: %d, New: %d, Garbage: %d, Collection Counts: %s"\
% (count, current_num_objects, new_objects, len(gc.garbage), gc.get_count())
previous_num_objects = current_num_objects
输出如下:
(0) This: 4042, New: 4042, Python Garbage: 0, Python Collection Counts: (660, 5, 0) (1) This: 4061, New: 19, Python Garbage: 0, Python Collection Counts: (90, 6, 0) (2) This: 4064, New: 3, Python Garbage: 0, Python Collection Counts: (127, 6, 0) (3) This: 4067, New: 3, Python Garbage: 0, Python Collection Counts: (130, 6, 0) (4) This: 4070, New: 3, Python Garbage: 0, Python Collection Counts: (133, 6, 0) (5) This: 4073, New: 3, Python Garbage: 0, Python Collection Counts: (136, 6, 0) (6) This: 4076, New: 3, Python Garbage: 0, Python Collection Counts: (139, 6, 0) (7) This: 4079, New: 3, Python Garbage: 0, Python Collection Counts: (142, 6, 0) (8) This: 4082, New: 3, Python Garbage: 0, Python Collection Counts: (145, 6, 0) (9) This: 4085, New: 3, Python Garbage: 0, Python Collection Counts: (148, 6, 0)
每第 100 次迭代,就会释放 100 个对象,因此每 100 次迭代 len(gc.get_objects() 就会增加 200。len(gc.garbage)
从不从0开始变化。第2代收集计数缓慢增加,而第0代和第1代计数则上下波动。
I am writing a custom file system crawler, which gets passed millions of globs to process through sys.stdin. I'm finding that when running the script, its memory usage increases massively over time and the whole thing crawls practically to a halt. I've written a minimal case below which shows the problem. Am I doing something wrong, or have I found a bug in Python / the glob module? (I am using python 2.5.2).
#!/usr/bin/env python
import glob
import sys
import gc
previous_num_objects = 0
for count, line in enumerate(sys.stdin):
glob_result = glob.glob(line.rstrip('\n'))
current_num_objects = len(gc.get_objects())
new_objects = current_num_objects - previous_num_objects
print "(%d) This: %d, New: %d, Garbage: %d, Collection Counts: %s"\
% (count, current_num_objects, new_objects, len(gc.garbage), gc.get_count())
previous_num_objects = current_num_objects
The output looks like:
(0) This: 4042, New: 4042, Python Garbage: 0, Python Collection Counts: (660, 5, 0) (1) This: 4061, New: 19, Python Garbage: 0, Python Collection Counts: (90, 6, 0) (2) This: 4064, New: 3, Python Garbage: 0, Python Collection Counts: (127, 6, 0) (3) This: 4067, New: 3, Python Garbage: 0, Python Collection Counts: (130, 6, 0) (4) This: 4070, New: 3, Python Garbage: 0, Python Collection Counts: (133, 6, 0) (5) This: 4073, New: 3, Python Garbage: 0, Python Collection Counts: (136, 6, 0) (6) This: 4076, New: 3, Python Garbage: 0, Python Collection Counts: (139, 6, 0) (7) This: 4079, New: 3, Python Garbage: 0, Python Collection Counts: (142, 6, 0) (8) This: 4082, New: 3, Python Garbage: 0, Python Collection Counts: (145, 6, 0) (9) This: 4085, New: 3, Python Garbage: 0, Python Collection Counts: (148, 6, 0)
Every 100th iteration, 100 objects are freed, so len(gc.get_objects()
increases by 200 every 100 iterations. len(gc.garbage)
never changes from 0. The 2nd generation collection count increases slowly, while the 0th and 1st counts go up and down.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我追踪到了 fnmatch 模块。 glob.glob 调用 fnmatch 来实际执行通配符,并且 fnmatch 有一个永远不会被清除的正则表达式缓存。因此,在这种用法中,缓存不断增长且不受控制。我已经针对 fnmatch 库 [1] 提交了一个错误。
[1]: http://bugs.python.org/issue7846 Python 错误
I tracked this down to the fnmatch module. glob.glob calls fnmatch to actually perform the globbing, and fnmatch has a cache of regular expressions which is never cleared. So in this usage, the cache was growing continuously and unchecked. I've filed a bug against the fnmatch library [1].
[1]: http://bugs.python.org/issue7846 Python Bug
我无法在我的系统上重现任何实际泄漏,但我认为您的“每 100 次迭代,释放 100 个对象”是指您访问已编译正则表达式的缓存(通过 glob 模块)。如果您查看 re.py,您会看到
_MAXCACHE
默认为 100,并且默认情况下,一旦您达到该值(在_compile
中),整个缓存就会被清除。如果您在调用gc
之前调用re.purge()
,您可能会看到该效果消失。(请注意,我只是建议在这里使用
re.purge()
来检查缓存是否影响您的 gc 结果。在您的实际代码中没有必要这样做。)我怀疑这是否可以修复您的大量问题不过内存增加的问题。
I cannot reproduce any actual leak on my system, but I think your "every 100th iteration, 100 objects are freed" is you hitting the cache for compiled regular expressions (via the glob module). If you peek at re.py you'll see
_MAXCACHE
defaults to 100, and by default the entire cache is blown away once you hit that (in_compile
). If you callre.purge()
before yourgc
calls you will probably see that effect go away.(note I'm only suggesting
re.purge()
here to check that cache is affecting your gc results. It should not be necessary to have that in your actual code.)I doubt that fixes your massive memory increase problem though.