使用 lxml,导致“lxml.etree.XMLSyntaxError:文档为空”的原因错误?

发布于 2024-10-11 03:41:52 字数 1545 浏览 6 评论 0原文

我正在使用 mechanize/cookiejar/lxml 来读取页面,它适用于某些页面,但不适用于其他页面。我在其中遇到的错误就是标题中的错误。我无法在这里发布这些页面,因为它们不是 SFW,但是有办法修复它吗?基本上,这就是我所做的:

import mechanize, cookielib
from lxml import etree    

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(False)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')]

response = br.open('...')
tree = etree.parse(response) #error

之后我获取根并在文档中搜索我想要的值。显然 iterparse 不会使其崩溃,但目前我认为它不会崩溃只是因为我没有用它处理任何东西。另外,我还没弄清楚如何用它来搜索东西。

我尝试过禁用 gzip 并启用发送引荐来源网址,但都没有解决问题。我还尝试将源代码保存到磁盘并从那里创建树只是为了它,但我得到了相同的错误。

编辑
我得到的响应似乎很好,按照建议使用 print repr(response) 我得到一个 > ;>。我还可以使用 read() 方法保存响应,并检查保存的 .xml 是否适用于浏览器和所有内容。

另外,在其中一个页面中,有一个 ’ 给出以下错误:“lxml.etree.XMLSyntaxError:实体 'rsquo' 未定义,第 17 行,第 7054 列”。到目前为止,我已经用正则表达式替换了它,但是有一个解析器可以处理这个问题吗?即使使用下面建议的 lxml.html.parse ,我也遇到了此错误。

关于突出显示的文件,我的意思是当我用 gEdit 打开它时,它会执行以下操作: http ://img34.imageshack.us/img34/9574/gedit.jpg

I'm using mechanize/cookiejar/lxml to read a page and it works for some but not others. The error I'm getting in them is the one in the title. I can't post the pages here because they aren't SFW, but is there a way to fix it? Basically, this is what I do:

import mechanize, cookielib
from lxml import etree    

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(False)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')]

response = br.open('...')
tree = etree.parse(response) #error

After that I get the root and search the document for the values I want. Apparently iterparse doesn't crash it, but at the moment I'm assuming it doesn't just because I didn't process anything with it. Plus, I haven't figured out yet how to search for the stuff with it.

I've tried disabling gzip and enabling sending the referer as well but neither solves the problem. I also tried saving the sourcecode to the disk and creating the tree from there just for the sake of it and I get the same error.

edit
The response I get seems to be fine, using print repr(response) as suggested I get a <response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>>. I can also save the response using the read() method and check that the saved .xml works on the browser and everything.

Also, in one of the pages, there is a that gives me the following error: "lxml.etree.XMLSyntaxError: Entity 'rsquo' not defined, line 17, column 7054". So far I've replaced it with a regex, but is there a parser that can handle this? I've gotten this error even with the lxml.html.parse suggested below.

Regarding the file being highlighted, I meant that when I open it with gEdit it does this kinda: http://img34.imageshack.us/img34/9574/gedit.jpg

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

谈情不如逗狗 2024-10-18 03:41:52

对 html 使用 lxml.html.parse 它甚至可以处理非常破碎的 html,那么您仍然会收到错误吗?

use lxml.html.parse for html it can handle even very broken html, you still get an error then?

一腔孤↑勇 2024-10-18 03:41:52

响应的本质是什么?根据帮助, etree.parse 需要以下之一:

   - a file name/path
   - a file object
   - a file-like object
   - a URL using the HTTP or FTP protocol

What is the nature of response? According to the help, etree.parse is expecting one of:

   - a file name/path
   - a file object
   - a file-like object
   - a URL using the HTTP or FTP protocol
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文