BeautifulSoup 用户的 html5lib/lxml 示例?
我正在尝试戒掉 BeautifulSoup,我喜欢它,但似乎(极力)不受支持。我正在尝试使用 html5lib 和 lxml,但我似乎无法弄清楚如何使用“find”和“findall”运算符。
通过查看 html5lib 的文档,我想出了一个测试程序:
import cStringIO
f = cStringIO.StringIO()
f.write("""
<html>
<body>
<table>
<tr>
<td>one</td>
<td>1</td>
</tr>
<tr>
<td>two</td>
<td>2</td
</tr>
</table>
</body>
</html>
""")
f.seek(0)
import html5lib
from html5lib import treebuilders
from lxml import etree # why?
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("lxml"))
etree_document = parser.parse(f)
root = etree_document.getroot()
root.find(".//tr")
但这返回 None。我注意到,如果我执行 etree.tostring(root)
我会取回所有数据,但所有标签都以 html
开头(例如 )。但是
root.find(".//html:tr")
抛出一个 KeyError 。
有人能让我回到正轨吗?
I'm trying to wean myself from BeautifulSoup, which I love but seems to be (aggressively) unsupported. I'm trying to work with html5lib and lxml, but I can't seem to figure out how to use the "find" and "findall" operators.
By looking at the docs for html5lib, I came up with this for a test program:
import cStringIO
f = cStringIO.StringIO()
f.write("""
<html>
<body>
<table>
<tr>
<td>one</td>
<td>1</td>
</tr>
<tr>
<td>two</td>
<td>2</td
</tr>
</table>
</body>
</html>
""")
f.seek(0)
import html5lib
from html5lib import treebuilders
from lxml import etree # why?
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("lxml"))
etree_document = parser.parse(f)
root = etree_document.getroot()
root.find(".//tr")
But this returns None. I noticed that if I do a etree.tostring(root)
I get all my data back, but all my tags are prefaced by html
(e.g. <html:table>
). But root.find(".//html:tr")
throws a KeyError.
Can someone put me back on the right track?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以使用以下命令关闭命名空间:
etree_document = html5lib.parse(t,treebuilder="lxml",namespaceHTMLElements=False)
You can turn off namespaces with this command:
etree_document = html5lib.parse(t, treebuilder="lxml", namespaceHTMLElements=False)
一般来说,对于 HTML 使用
lxml.html
。那么你不需要担心生成你自己的解析器和解析器。担心命名空间。仅供参考,
lxml.html
还允许您使用 CSS 选择器,我发现这是一种更简单的语法。In general, use
lxml.html
for HTML. Then you don't need to worry about generating your own parser & worrying about namespaces.FYI,
lxml.html
also allows you to use CSS selectors, which I find is an easier syntax.看来使用“lxml”html5lib
TreeBuilder
会导致 html5lib 在 XHTML 命名空间中构建树 - 这是有道理的,因为 lxml 是一个 XML 库,而 XHTML 是将 HTML 表示为 XML 的方式。您可以使用 lxml 的 qname 语法和find()
方法执行以下操作:或者您可以使用 lxml 的完整 XPath 函数执行以下操作:
lxml 文档 有关如何使用 XML 命名空间的更多信息。
It appears that using the "lxml" html5lib
TreeBuilder
causes html5lib to build the tree in the XHTML namespace -- which makes sense, as lxml is an XML library, and XHTML is how one represents HTML as XML. You can use lxml's qname syntax with thefind()
method to do something like:Or you can use lxml's full XPath functions to do something like:
The lxml documentation has more information on how it uses XML namespaces.
我意识到这是一个老问题,但我来这里是为了寻找在其他地方找不到的信息。我试图用 BeautifulSoup 抓取一些东西,但它被一些厚实的 html 窒息了。默认的 html 解析器显然比其他可用的解析器宽松。一种通常首选的解析器是 lxml,我相信它会产生与浏览器预期相同的解析。 BeautifulSoup 允许您指定 lxml 作为源解析器,但使用它需要一些工作。
首先,您需要 html5lib 并且还必须安装 lxml。虽然 html5lib 准备使用 lxml (以及其他一些库),但两者并未打包在一起。 [对于 Windows 用户,尽管我不喜欢对 Win 依赖项大惊小怪,以至于我通常通过在与项目相同的目录中创建副本来获取库,但我强烈建议使用 pip 来实现此目的;相当无痛;我认为你需要管理员权限。]
然后你需要写这样的东西:
然后享受你美丽的汤!
请注意解析器上的namespaceHTMLElements=false 选项。这很重要,因为 lxml 旨在用于 XML 而不仅仅是 HTML。因此,它将把它提供的所有标签标记为属于 HTML 命名空间。标签看起来像(例如)
,而 BeautifulSoup 将无法正常工作。
I realize that this is an old question, but I came here in a quest for information I didn't find in any other one place. I was trying to scrape something with BeautifulSoup but it was choking on some chunky html. The default html parser is apparently less loose than some others that are available. One often preferred parser is lxml, which I believe produces the same parsing as expected for browsers. BeautifulSoup allows you to specify lxml as the source parser, but using it requires a little bit of work.
First, you need html5lib AND you must also install lxml. While html5lib is prepared to use lxml (and some other libraries), the two do not come packaged together. [for Windows users, even though I don't like fussing with Win dependencies to the extent that I usually get libraries by making a copy in the same directory as my project, I strongly recommend using pip for this; pretty painless; I think you need administrator access.]
Then you need to write something like this:
Then enjoy your beautiful soup!
Note the namespaceHTMLElements=false option on the parser. This is important because lxml is intended for XML as opposed to just HTML. Because of that, it will label all the tags it provides as belonging to the HTML namespace. The tags will look like (for example)
and BeautifulSoup will not work well.
尝试:
您必须指定名称空间而不是名称空间前缀 (
html:tr
)。有关详细信息,请参阅 lxml 文档,特别是以下部分:Try:
You have to specify the namespace rather than the namespace prefix (
html:tr
). For more information, see the lxml docs, particularly the section: