使用 lxml，导致“lxml.etree.XMLSyntaxError：文档为空”的原因错误？

发布于 2024-10-11 03:41:52 字数 1545 浏览 6 评论 0原文

我正在使用 mechanize/cookiejar/lxml 来读取页面，它适用于某些页面，但不适用于其他页面。我在其中遇到的错误就是标题中的错误。我无法在这里发布这些页面，因为它们不是 SFW，但是有办法修复它吗？基本上，这就是我所做的：

import mechanize, cookielib
from lxml import etree    

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(False)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')]

response = br.open('...')
tree = etree.parse(response) #error

之后我获取根并在文档中搜索我想要的值。显然 iterparse 不会使其崩溃，但目前我认为它不会崩溃只是因为我没有用它处理任何东西。另外，我还没弄清楚如何用它来搜索东西。

我尝试过禁用 gzip 并启用发送引荐来源网址，但都没有解决问题。我还尝试将源代码保存到磁盘并从那里创建树只是为了它，但我得到了相同的错误。

编辑
我得到的响应似乎很好，按照建议使用 print repr(response) 我得到一个 > ;>。我还可以使用 read() 方法保存响应，并检查保存的 .xml 是否适用于浏览器和所有内容。

另外，在其中一个页面中，有一个 ’ 给出以下错误：“lxml.etree.XMLSyntaxError：实体 'rsquo' 未定义，第 17 行，第 7054 列”。到目前为止，我已经用正则表达式替换了它，但是有一个解析器可以处理这个问题吗？即使使用下面建议的 lxml.html.parse ，我也遇到了此错误。

关于突出显示的文件，我的意思是当我用 gEdit 打开它时，它会执行以下操作： http ://img34.imageshack.us/img34/9574/gedit.jpg

原文

I'm using mechanize/cookiejar/lxml to read a page and it works for some but not others. The error I'm getting in them is the one in the title. I can't post the pages here because they aren't SFW, but is there a way to fix it? Basically, this is what I do:

import mechanize, cookielib
from lxml import etree    

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(False)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')]

response = br.open('...')
tree = etree.parse(response) #error

After that I get the root and search the document for the values I want. Apparently iterparse doesn't crash it, but at the moment I'm assuming it doesn't just because I didn't process anything with it. Plus, I haven't figured out yet how to search for the stuff with it.

I've tried disabling gzip and enabling sending the referer as well but neither solves the problem. I also tried saving the sourcecode to the disk and creating the tree from there just for the sake of it and I get the same error.

edit
The response I get seems to be fine, using print repr(response) as suggested I get a <response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>>. I can also save the response using the read() method and check that the saved .xml works on the browser and everything.

Also, in one of the pages, there is a ’ that gives me the following error: "lxml.etree.XMLSyntaxError: Entity 'rsquo' not defined, line 17, column 7054". So far I've replaced it with a regex, but is there a parser that can handle this? I've gotten this error even with the lxml.html.parse suggested below.

Regarding the file being highlighted, I meant that when I open it with gEdit it does this kinda: http://img34.imageshack.us/img34/9574/gedit.jpg

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谈情不如逗狗 2024-10-18 03:41:52

对 html 使用 lxml.html.parse 它甚至可以处理非常破碎的 html，那么您仍然会收到错误吗？

回复收藏 0 原文

一腔孤↑勇 2024-10-18 03:41:52

响应的本质是什么？根据帮助， etree.parse 需要以下之一：

   - a file name/path
   - a file object
   - a file-like object
   - a URL using the HTTP or FTP protocol

What is the nature of response? According to the help, etree.parse is expecting one of:

   - a file name/path
   - a file object
   - a file-like object
   - a URL using the HTTP or FTP protocol

回复收藏 0 原文

~没有更多了~

关于作者

紫﹏色ふ单纯

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

使用 lxml，导致“lxml.etree.XMLSyntaxError：文档为空”的原因错误？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

巷子口的你

微信用户

神妖

鞋纸虽美，但不合脚ㄋ〞

7460852697

ligengkai

友情链接

使用 lxml，导致“lxml.etree.XMLSyntaxError：文档为空”的原因错误？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

巷子口的你

微信用户

神妖

鞋纸虽美，但不合脚ㄋ〞

7460852697

ligengkai

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。