lxml 无法解析
我想解析html中的表格,但我发现lxml无法解析它?怎么了?
# -*- coding: utf8 -*-
import urllib
import lxml.etree
keyword = 'lxml+tutorial'
url = 'http://www.baidu.com/s?wd='
if __name__ == '__main__':
page = 0
link = url + keyword + '&pn=' + str(page)
f = urllib.urlopen(link)
content = f.read()
f.close()
tree = lxml.etree.HTML(content)
query_link = '//table'
info_link = tree.xpath(query_link)
print info_link
打印结果只是[]...
I want to parse tables in html, but i found lxml can't parse it? what's wrong?
# -*- coding: utf8 -*-
import urllib
import lxml.etree
keyword = 'lxml+tutorial'
url = 'http://www.baidu.com/s?wd='
if __name__ == '__main__':
page = 0
link = url + keyword + '&pn=' + str(page)
f = urllib.urlopen(link)
content = f.read()
f.close()
tree = lxml.etree.HTML(content)
query_link = '//table'
info_link = tree.xpath(query_link)
print info_link
the print result is just []...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
lxml 的 文档说,“对解析损坏 HTML 的支持完全取决于 libxml2 的恢复算法。这不是错误如果您发现文档严重损坏以致解析器无法处理它们,则无法保证生成的树将包含原始文档中的所有数据解析器在努力继续解析时可能不得不丢弃严重损坏的部分。 ”。
果然,百度返回的HTML是无效的:W3C 验证器报告“173 个错误,7 个警告”。我不知道(也没有调查过)这些特定的错误是否给您带来了 lxml 的麻烦,因为我认为您使用 lxml 解析“野外”发现的 HTML 的策略(这几乎总是无效的)注定是失败的。
为了解析无效的 HTML,您需要一个实现 (非常奇怪!)HTML 错误恢复算法。所以我建议将 lxml 替换为 html5lib,它可以毫无问题地处理百度的无效 HTML:
lxml's documentation says, "The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing."
And sure enough, the HTML returned by Baidu is invalid: the W3C validator reports "173 Errors, 7 warnings". I don't know (and haven't investigated) whether these particular errors have caused your trouble with lxml, because I think that your strategy of using lxml to parse HTML found "in the wild" (which is nearly always invalid) is doomed.
For parsing invalid HTML, you need a parser that implements the (surprisingly bizarre!) HTML error recovery algorithm. So I recommend swapping lxml for html5lib, which handles Baidu's invalid HTML with no problems:
我看到代码有几个可以改进的地方,但是,对于您的问题,以下是我的建议:
使用
lxml.html.parse(link)
而不是lxml.etree.HTML( content)
因此所有“正常工作”的自动功能都可以启动。(例如,正确处理标头中的字符编码声明)尝试使用
tree.findall (".//table")
而不是tree.xpath("//table")
。我不确定它是否会产生影响,但几个小时前我刚刚在自己的项目中使用了该语法,没有出现任何问题,而且作为奖励,它与非 LXML ElementTree API 兼容。我建议的另一件主要事情是使用 Python 的内置函数来构建 URL,这样你就可以确保你正在构建的 URL 在所有情况下都是有效的并正确转义。
如果 LXML 找不到表并且浏览器显示表存在,我只能想象这是以下三个问题之一:
lxml.etree.HTML
。I see several places that code could be improved but, for your question, here are my suggestions:
Use
lxml.html.parse(link)
rather thanlxml.etree.HTML(content)
so all the "just works" automatics can kick in. (eg. Handling character coding declarations in headers properly)Try using
tree.findall(".//table")
rather thantree.xpath("//table")
. I'm not sure whether it'll make a difference, but I just used that syntax in a project of my own a few hours ago without issue and, as a bonus, it's compatible with non-LXML ElementTree APIs.The other major thing I'd suggest would be using Python's built-in functions for building URLs so you can be sure the URL you're building is valid and properly escaped in all circumstances.
If LXML can't find a table and the browser shows a table to exist, I can only imagine it's one of these three problems:
lxml.etree.HTML
when called directly.