lxml 无法解析?

发布于 2024-09-30 21:56:44 字数 485 浏览 5 评论 0原文

我想解析html中的表格,但我发现lxml无法解析它?怎么了?

# -*- coding: utf8 -*-
import urllib
import lxml.etree
keyword = 'lxml+tutorial'

url = 'http://www.baidu.com/s?wd='
if __name__ == '__main__':
    page = 0

    link = url + keyword + '&pn=' + str(page)

    f = urllib.urlopen(link)
    content = f.read()
    f.close()

    tree = lxml.etree.HTML(content)

    query_link = '//table'

    info_link = tree.xpath(query_link)

    print info_link

打印结果只是[]...

I want to parse tables in html, but i found lxml can't parse it? what's wrong?

# -*- coding: utf8 -*-
import urllib
import lxml.etree
keyword = 'lxml+tutorial'

url = 'http://www.baidu.com/s?wd='
if __name__ == '__main__':
    page = 0

    link = url + keyword + '&pn=' + str(page)

    f = urllib.urlopen(link)
    content = f.read()
    f.close()

    tree = lxml.etree.HTML(content)

    query_link = '//table'

    info_link = tree.xpath(query_link)

    print info_link

the print result is just []...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

若水般的淡然安静女子 2024-10-07 21:56:44

lxml 的 文档说,“对解析损坏 HTML 的支持完全取决于 libxml2 的恢复算法。这不是错误如果您发现文档严重损坏以致解析器无法处理它们,则无法保证生成的树将包含原始文档中的所有数据解析器在努力继续解析时可能不得不丢弃严重损坏的部分。 ”。

果然,百度返回的HTML是无效的:W3C 验证器报告“173 个错误,7 个警告”。我不知道(也没有调查过)这些特定的错误是否给您带来了 lxml 的麻烦,因为我认为您使用 lxml 解析“野外”发现的 HTML 的策略(这几乎总是无效的)注定是失败的。

为了解析无效的 HTML,您需要一个实现 (非常奇怪!)HTML 错误恢复算法。所以我建议将 lxml 替换为 html5lib,它可以毫无问题地处理百度的无效 HTML:

>>> import urllib
>>> from html5lib import html5parser, treebuilders
>>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
>>> dom = p.parse(urllib.urlopen('http://www.baidu.com/s?wd=foo').read())
>>> len(dom.getElementsByTagName('table'))
12

lxml's documentation says, "The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing."

And sure enough, the HTML returned by Baidu is invalid: the W3C validator reports "173 Errors, 7 warnings". I don't know (and haven't investigated) whether these particular errors have caused your trouble with lxml, because I think that your strategy of using lxml to parse HTML found "in the wild" (which is nearly always invalid) is doomed.

For parsing invalid HTML, you need a parser that implements the (surprisingly bizarre!) HTML error recovery algorithm. So I recommend swapping lxml for html5lib, which handles Baidu's invalid HTML with no problems:

>>> import urllib
>>> from html5lib import html5parser, treebuilders
>>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
>>> dom = p.parse(urllib.urlopen('http://www.baidu.com/s?wd=foo').read())
>>> len(dom.getElementsByTagName('table'))
12
软糖 2024-10-07 21:56:44

我看到代码有几个可以改进的地方,但是,对于您的问题,以下是我的建议:

  1. 使用 lxml.html.parse(link) 而不是 lxml.etree.HTML( content) 因此所有“正常工作”的自动功能都可以启动。(例如,正确处理标头中的字符编码声明)

  2. 尝试使用 tree.findall (".//table") 而不是 tree.xpath("//table")。我不确定它是否会产生影响,但几个小时前我刚刚在自己的项目中使用了该语法,没有出现任何问题,而且作为奖励,它与非 LXML ElementTree API 兼容。

我建议的另一件主要事情是使用 Python 的内置函数来构建 URL,这样你就可以确保你正在构建的 URL 在所有情况下都是有效的并正确转义。

如果 LXML 找不到表并且浏览器显示表存在,我只能想象这是以下三个问题之一:

  1. 错误的请求。 LXML 获取一个没有表格的页面。 (例如错误 404 或 500)
  2. 解析错误。直接调用时,页面中的某些内容会混淆 lxml.etree.HTML
  3. 需要 JavaScript。也许该表是在客户端生成的。

I see several places that code could be improved but, for your question, here are my suggestions:

  1. Use lxml.html.parse(link) rather than lxml.etree.HTML(content) so all the "just works" automatics can kick in. (eg. Handling character coding declarations in headers properly)

  2. Try using tree.findall(".//table") rather than tree.xpath("//table"). I'm not sure whether it'll make a difference, but I just used that syntax in a project of my own a few hours ago without issue and, as a bonus, it's compatible with non-LXML ElementTree APIs.

The other major thing I'd suggest would be using Python's built-in functions for building URLs so you can be sure the URL you're building is valid and properly escaped in all circumstances.

If LXML can't find a table and the browser shows a table to exist, I can only imagine it's one of these three problems:

  1. Bad request. LXML gets a page without a table in it. (eg. error 404 or 500)
  2. Bad parsing. Something about the page confused lxml.etree.HTML when called directly.
  3. Javascript needed. Maybe the table is generated client-side.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文