BeautifulSoup 用户的 html5lib/lxml 示例?

发布于 2024-09-19 01:10:15 字数 987 浏览 15 评论 0原文

我正在尝试戒掉 BeautifulSoup,我喜欢它,但似乎(极力)不受支持。我正在尝试使用 html5lib 和 lxml,但我似乎无法弄清楚如何使用“find”和“findall”运算符。

通过查看 html5lib 的文档,我想出了一个测试程序:

import cStringIO

f = cStringIO.StringIO()
f.write("""
  <html>
    <body>
      <table>
       <tr>
          <td>one</td>
          <td>1</td>
       </tr>
       <tr>
          <td>two</td>
          <td>2</td
       </tr>
      </table>
    </body>
  </html>
  """)
f.seek(0)

import html5lib
from html5lib import treebuilders
from lxml import etree  # why?

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("lxml"))
etree_document = parser.parse(f)

root = etree_document.getroot()

root.find(".//tr")

但这返回 None。我注意到,如果我执行 etree.tostring(root) 我会取回所有数据,但所有标签都以 html 开头(例如 )。但是 root.find(".//html:tr") 抛出一个 KeyError 。

有人能让我回到正轨吗?

I'm trying to wean myself from BeautifulSoup, which I love but seems to be (aggressively) unsupported. I'm trying to work with html5lib and lxml, but I can't seem to figure out how to use the "find" and "findall" operators.

By looking at the docs for html5lib, I came up with this for a test program:

import cStringIO

f = cStringIO.StringIO()
f.write("""
  <html>
    <body>
      <table>
       <tr>
          <td>one</td>
          <td>1</td>
       </tr>
       <tr>
          <td>two</td>
          <td>2</td
       </tr>
      </table>
    </body>
  </html>
  """)
f.seek(0)

import html5lib
from html5lib import treebuilders
from lxml import etree  # why?

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("lxml"))
etree_document = parser.parse(f)

root = etree_document.getroot()

root.find(".//tr")

But this returns None. I noticed that if I do a etree.tostring(root) I get all my data back, but all my tags are prefaced by html (e.g. <html:table>). But root.find(".//html:tr") throws a KeyError.

Can someone put me back on the right track?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

把人绕傻吧 2024-09-26 01:10:15

您可以使用以下命令关闭命名空间:
etree_document = html5lib.parse(t,treebuilder="lxml",namespaceHTMLElements=False)

You can turn off namespaces with this command:
etree_document = html5lib.parse(t, treebuilder="lxml", namespaceHTMLElements=False)

追风人 2024-09-26 01:10:15

一般来说,对于 HTML 使用 lxml.html。那么你不需要担心生成你自己的解析器和解析器。担心命名空间。

>>> import lxml.html as l
>>> doc = """
...    <html><body>
...    <table>
...      <tr>
...        <td>one</td>
...        <td>1</td>
...      </tr>
...      <tr>
...        <td>two</td>
...        <td>2</td
...      </tr>
...    </table>
...    </body></html>"""
>>> doc = l.document_fromstring(doc)
>>> doc.finall('.//tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS

仅供参考,lxml.html 还允许您使用 CSS 选择器,我发现这是一种更简单的语法。

>>> doc.cssselect('tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS

In general, use lxml.html for HTML. Then you don't need to worry about generating your own parser & worrying about namespaces.

>>> import lxml.html as l
>>> doc = """
...    <html><body>
...    <table>
...      <tr>
...        <td>one</td>
...        <td>1</td>
...      </tr>
...      <tr>
...        <td>two</td>
...        <td>2</td
...      </tr>
...    </table>
...    </body></html>"""
>>> doc = l.document_fromstring(doc)
>>> doc.finall('.//tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS

FYI, lxml.html also allows you to use CSS selectors, which I find is an easier syntax.

>>> doc.cssselect('tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS
沉溺在你眼里的海 2024-09-26 01:10:15

看来使用“lxml”html5lib TreeBuilder 会导致 html5lib 在 XHTML 命名空间中构建树 - 这是有道理的,因为 lxml 是一个 XML 库,而 XHTML 是将 HTML 表示为 XML 的方式。您可以使用 lxml 的 qname 语法和 find() 方法执行以下操作:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

或者您可以使用 lxml 的完整 XPath 函数执行以下操作:

root.xpath('.//html:tr', namespaces={'html': 'http://www.w3.org/1999/xhtml'})

lxml 文档 有关如何使用 XML 命名空间的更多信息。

It appears that using the "lxml" html5lib TreeBuilder causes html5lib to build the tree in the XHTML namespace -- which makes sense, as lxml is an XML library, and XHTML is how one represents HTML as XML. You can use lxml's qname syntax with the find() method to do something like:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

Or you can use lxml's full XPath functions to do something like:

root.xpath('.//html:tr', namespaces={'html': 'http://www.w3.org/1999/xhtml'})

The lxml documentation has more information on how it uses XML namespaces.

深巷少女 2024-09-26 01:10:15

我意识到这是一个老问题,但我来这里是为了寻找在其他地方找不到的信息。我试图用 BeautifulSoup 抓取一些东西,但它被一些厚实的 html 窒息了。默认的 html 解析器显然比其他可用的解析器宽松。一种通常首选的解析器是 lxml,我相信它会产生与浏览器预期相同的解析。 BeautifulSoup 允许您指定 lxml 作为源解析器,但使用它需要一些工作。

首先,您需要 html5lib 并且还必须安装 lxml。虽然 html5lib 准备使用 lxml (以及其他一些库),但两者并未打包在一起。 [对于 Windows 用户,尽管我不喜欢对 Win 依赖项大惊小怪,以至于我通常通过在与项目相同的目录中创建副本来获取库,但我强烈建议使用 pip 来实现此目的;相当无痛;我认为你需要管理员权限。]

然后你需要写这样的东西:

import urllib2
from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
from lxml import etree

url = 'http://...'

content = urllib2.urlopen(url)
parser = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
                             tree=treebuilders.getTreeBuilder("lxml"),
                             namespaceHTMLElements=False)
htmlData = parser.parse(content)
htmlStr = etree.tostring(htmlData)

soup = BeautifulSoup(htmlStr, "lxml")

然后享受你美丽的汤!

请注意解析器上的namespaceHTMLElements=false 选项。这很重要,因为 lxml 旨在用于 XML 而不仅仅是 HTML。因此,它将把它提供的所有标签标记为属于 HTML 命名空间。标签看起来像(例如)

<html:li>

,而 BeautifulSoup 将无法正常工作。

I realize that this is an old question, but I came here in a quest for information I didn't find in any other one place. I was trying to scrape something with BeautifulSoup but it was choking on some chunky html. The default html parser is apparently less loose than some others that are available. One often preferred parser is lxml, which I believe produces the same parsing as expected for browsers. BeautifulSoup allows you to specify lxml as the source parser, but using it requires a little bit of work.

First, you need html5lib AND you must also install lxml. While html5lib is prepared to use lxml (and some other libraries), the two do not come packaged together. [for Windows users, even though I don't like fussing with Win dependencies to the extent that I usually get libraries by making a copy in the same directory as my project, I strongly recommend using pip for this; pretty painless; I think you need administrator access.]

Then you need to write something like this:

import urllib2
from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
from lxml import etree

url = 'http://...'

content = urllib2.urlopen(url)
parser = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
                             tree=treebuilders.getTreeBuilder("lxml"),
                             namespaceHTMLElements=False)
htmlData = parser.parse(content)
htmlStr = etree.tostring(htmlData)

soup = BeautifulSoup(htmlStr, "lxml")

Then enjoy your beautiful soup!

Note the namespaceHTMLElements=false option on the parser. This is important because lxml is intended for XML as opposed to just HTML. Because of that, it will label all the tags it provides as belonging to the HTML namespace. The tags will look like (for example)

<html:li>

and BeautifulSoup will not work well.

一花一树开 2024-09-26 01:10:15

尝试:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

您必须指定名称空间而不是名称空间前缀 (html:tr)。有关详细信息,请参阅 lxml 文档,特别是以下部分:

Try:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

You have to specify the namespace rather than the namespace prefix (html:tr). For more information, see the lxml docs, particularly the section:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文