lxml 解析器吃掉所有内存
我正在用 python 编写一些蜘蛛,并使用 lxml 库来解析 html 和 gevent 库以进行异步。我发现工作一段时间后,lxml 解析器开始占用高达 8GB 的内存(所有服务器内存)。但我只有 100 个异步线程,每个线程解析文档最大为 300kb。
我测试过并发现该问题从 lxml.html.fromstring 开始,但我无法重现此问题。
这行代码中的问题:
HTML = lxml.html.fromstring(htmltext)
也许有人知道它是什么,或者想解决这个问题?
感谢您的帮助。
PS
Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
UP:
我为使用 lxml 解析器的进程设置了 ulimit -Sv 500000 和 uliit -Sm 615000。
现在,过了一段时间,他们开始在错误日志中写入:
“Exception MemoryError:‘lxml.etree._BaseErrorLog._receive’中的 MemoryError() 被忽略”。
我无法捕获此异常,因此它会递归地写入日志中,直到磁盘上有可用空间。
我怎样才能捕获这个异常来终止进程,以便守护进程可以创建新的进程?
I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.
i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.
The problem in this line of code:
HTML = lxml.html.fromstring(htmltext)
Maybe someone know what it can be, or hoe to fix this?
Thanks for help.
P.S.
Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
UP:
i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.
And now in with some time they start writing in error log:
"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".
And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.
How can i catch this exception to kill process so daemon can create new one??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能会保留一些使文档保持活力的参考资料。例如,要小心 xpath 评估的字符串结果:默认情况下,它们是“智能”字符串,它们提供对包含元素的访问,因此如果您保留对它们的引用,则将树保留在内存中。请参阅有关 xpath 返回值的文档:
(我不知道这是否是你的情况的问题,但它是一个候选人。我自己也被这个问题咬过一次;-))
You might be keeping some references which keep the documents alive. Be careful with string results from xpath evaluation for example: by default they are "smart" strings, which provide access to the containing element, thus keeping the tree in memory if you keep a reference to them. See the docs on xpath return values:
(I have no idea if this is the problem in your case, but it's a candidate. I've been bitten by this myself once ;-))
http://www.tracing-python-memory-leaks" rel="nofollow noreferrer">有一篇很棒的文章。 lshift.net/blog/2008/11/14/tracing-python-memory-leaks 演示了内存结构的图形化调试;这可能会帮助您找出哪些内容未发布以及原因。
编辑:我找到了从中获得该链接的文章 - Python 内存泄漏< /a>
There is an excellent article at http://www.lshift.net/blog/2008/11/14/tracing-python-memory-leaks which demonstrates graphical debugging of memory structures; this might help you figure out what's not being released and why.
Edit: I found the article from which I got that link - Python memory leaks
看来问题源于 lxml 依赖的库:libxml2,它是用 C 语言编写的。
这是第一份报告: http://codespeak.net/pipermail/lxml -dev/2010-12月/005784.html
lxml v2.3 错误修复日志或 libxml2 更改日志中均未提及此错误。
哦,这里有后续邮件: https://bugs.launchpad.net/lxml/+bug /728924
好吧,我尝试重现该问题,但没有发现任何异常。能够重现它的人可能有助于澄清问题。
It seems the issue stems from the library lxml relies on: libxml2 which is written in C language.
Here is the first report: http://codespeak.net/pipermail/lxml-dev/2010-December/005784.html
This bug hasn't been mentioned either in lxml v2.3 bug fix logs or in libxml2 change logs.
Oh, there is followup mails here: https://bugs.launchpad.net/lxml/+bug/728924
Well, I tried to reproduce the issue, but get nothing abnormal. Guys who can reproduce it may help to clarify the problem.