lxml 解析器吃掉所有内存

发布于 2024-10-21 03:13:50 字数 927 浏览 6 评论 0原文

我正在用 python 编写一些蜘蛛，并使用 lxml 库来解析 html 和 gevent 库以进行异步。我发现工作一段时间后，lxml 解析器开始占用高达 8GB 的内存（所有服务器内存）。但我只有 100 个异步线程，每个线程解析文档最大为 300kb。

我测试过并发现该问题从 lxml.html.fromstring 开始，但我无法重现此问题。

这行代码中的问题：

HTML = lxml.html.fromstring(htmltext)

也许有人知道它是什么，或者想解决这个问题？

感谢您的帮助。

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

UP：

我为使用 lxml 解析器的进程设置了 ulimit -Sv 500000 和 uliit -Sm 615000。

现在，过了一段时间，他们开始在错误日志中写入：

“Exception MemoryError：‘lxml.etree._BaseErrorLog._receive’中的 MemoryError() 被忽略”。

我无法捕获此异常，因此它会递归地写入日志中，直到磁盘上有可用空间。

我怎样才能捕获这个异常来终止进程，以便守护进程可以创建新的进程？

原文

I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.

i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.

The problem in this line of code:

HTML = lxml.html.fromstring(htmltext)

Maybe someone know what it can be, or hoe to fix this?

Thanks for help.

P.S.

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

UP:

i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.

And now in with some time they start writing in error log:

"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".

And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.

How can i catch this exception to kill process so daemon can create new one??

分享到QQ

分享到微博