lxml 解析器吃掉所有内存

发布于 2024-10-21 03:13:50 字数 927 浏览 6 评论 0原文

我正在用 python 编写一些蜘蛛,并使用 lxml 库来解析 html 和 gevent 库以进行异步。我发现工作一段时间后,lxml 解析器开始占用高达 8GB 的​​内存(所有服务器内存)。但我只有 100 个异步线程,每个线程解析文档最大为 300kb。

我测试过并发现该问题从 lxml.html.fromstring 开始,但我无法重现此问题。

这行代码中的问题:

HTML = lxml.html.fromstring(htmltext)

也许有人知道它是什么,或者想解决这个问题?

感谢您的帮助。

PS

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

UP:

我为使用 lxml 解析器的进程设置了 ulimit -Sv 500000 和 uliit -Sm 615000。

现在,过了一段时间,他们开始在错误日志中写入:

“Exception MemoryError:‘lxml.etree._BaseErrorLog._receive’中的 MemoryError() 被忽略”。

我无法捕获此异常,因此它会递归地写入日志中,直到磁盘上有可用空间。

我怎样才能捕获这个异常来终止进程,以便守护进程可以创建新的进程?

I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.

i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.

The problem in this line of code:

HTML = lxml.html.fromstring(htmltext)

Maybe someone know what it can be, or hoe to fix this?

Thanks for help.

P.S.

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

UP:

i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.

And now in with some time they start writing in error log:

"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".

And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.

How can i catch this exception to kill process so daemon can create new one??

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

我的鱼塘能养鲲 2024-10-28 03:13:50

您可能会保留一些使文档保持活力的参考资料。例如,要小心 xpath 评估的字符串结果:默认情况下,它们是“智能”字符串,它们提供对包含元素的访问,因此如果您保留对它们的引用,则将树保留在内存中。请参阅有关 xpath 返回值的文档:

在某些情况下,智能字符串行为是不受欢迎的。例如,这意味着树将通过字符串保持活动状态,如果字符串值是树中唯一真正感兴趣的内容,这可能会产生相当大的内存影响。对于这些情况,您可以使用关键字参数 smart_strings 停用亲子关系。

(我不知道这是否是你的情况的问题,但它是一个候选人。我自己也被这个问题咬过一次;-))

You might be keeping some references which keep the documents alive. Be careful with string results from xpath evaluation for example: by default they are "smart" strings, which provide access to the containing element, thus keeping the tree in memory if you keep a reference to them. See the docs on xpath return values:

There are certain cases where the smart string behaviour is undesirable. For example, it means that the tree will be kept alive by the string, which may have a considerable memory impact in the case that the string value is the only thing in the tree that is actually of interest. For these cases, you can deactivate the parental relationship using the keyword argument smart_strings.

(I have no idea if this is the problem in your case, but it's a candidate. I've been bitten by this myself once ;-))

赢得她心 2024-10-28 03:13:50

http://www.tracing-python-memory-leaks" rel="nofollow noreferrer">有一篇很棒的文章。 lshift.net/blog/2008/11/14/tracing-python-memory-leaks 演示了内存结构的图形化调试;这可能会帮助您找出哪些内容未发布以及原因。

编辑:我找到了从中获得该链接的文章 - Python 内存泄漏< /a>

There is an excellent article at http://www.lshift.net/blog/2008/11/14/tracing-python-memory-leaks which demonstrates graphical debugging of memory structures; this might help you figure out what's not being released and why.

Edit: I found the article from which I got that link - Python memory leaks

十秒萌定你 2024-10-28 03:13:50

看来问题源于 lxml 依赖的库:libxml2,它是用 C 语言编写的。
这是第一份报告: http://codespeak.net/pipermail/lxml -dev/2010-12月/005784.html
lxml v2.3 错误修复日志或 libxml2 更改日志中均未提及此错误。

哦,这里有后续邮件: https://bugs.launchpad.net/lxml/+bug /728924

好吧,我尝试重现该问题,但没有发现任何异常。能够重现它的人可能有助于澄清问题。

It seems the issue stems from the library lxml relies on: libxml2 which is written in C language.
Here is the first report: http://codespeak.net/pipermail/lxml-dev/2010-December/005784.html
This bug hasn't been mentioned either in lxml v2.3 bug fix logs or in libxml2 change logs.

Oh, there is followup mails here: https://bugs.launchpad.net/lxml/+bug/728924

Well, I tried to reproduce the issue, but get nothing abnormal. Guys who can reproduce it may help to clarify the problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文