为什么在使用 lxml(在 python 中)处理 XHTML 文档时 xpath 不起作用?

发布于 2024-07-09 05:03:56 字数 1431 浏览 6 评论 0原文

我正在针对以下测试文档进行测试:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
        <title>hi there</title>
    </head>
    <body>
        <img class="foo" src="bar.png"/>
    </body>
</html>

如果我使用 lxml.html 解析文档,我可以使用 xpath 获取 IMG:

>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]

但是,如果我将文档解析为 XML 并尝试获取 IMG 标签,我会得到一个空结果:

>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]

我可以直接导航到该元素:

>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>

但这当然不能帮助我处理任意文档。 我还希望能够查询 etree 以获取直接标识此元素的 xpath 表达式,从技术上讲,我可以这样做:

>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]

但是该 xpath 显然对于解析任意文档没有用处。

显然我在这里遗漏了一些关键问题,但我不知道它是什么。 我最好的猜测是它与命名空间有关,但定义的唯一命名空间是默认命名空间,我不知道在命名空间方面我还需要考虑什么。

那么,我错过了什么?

I am testing against the following test document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
        <title>hi there</title>
    </head>
    <body>
        <img class="foo" src="bar.png"/>
    </body>
</html>

If I parse the document using lxml.html, I can get the IMG with an xpath just fine:

>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]

However, if I parse the document as XML and try to get the IMG tag, I get an empty result:

>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]

I can navigate to the element directly:

>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>

But of course that doesn't help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:

>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]

But that xpath is, again, obviously not useful for parsing arbitrary documents.

Obviously I am missing some key issue here, but I don't know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don't know what else I might need to consider in regards to namespaces.

So, what am I missing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

顾北清歌寒 2024-07-16 05:03:56

问题在于命名空间。 当解析为 XML 时,img 标签位于 http://www.w3.org/1999/xhtml< /a> 命名空间,因为这是元素的默认命名空间。 您要求在任何命名空间中提供 img 标签。

尝试这个:

>>> tree.getroot().xpath(
...     "//xhtml:img", 
...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
...     )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.

Try this:

>>> tree.getroot().xpath(
...     "//xhtml:img", 
...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
...     )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]
作妖 2024-07-16 05:03:56

XPath 认为所有不带前缀的名称都位于“无命名空间”

特别是规范说:

“节点测试中的 QName 使用表达式上下文中的命名空间声明扩展为扩展名称。这与开始和结束标记中的元素类型名称的扩展方式相同,除了不使用使用 xmlns 声明的默认命名空间:如果 QName 没有前缀,则命名空间 URI 为 null(这与扩展属性名称的方式相同)”

请参阅该问题及其解决方案的两个详细说明:< a href="http://www.topxml.com/people/bosley/defaultns.asp" rel="nofollow noreferrer">此处此处。 解决方案是关联一个前缀(与正在使用的 API)并使用它为 XPath 表达式中任何无前缀的名称添加前缀。

希望这有帮助。

干杯,

迪米特·诺瓦切夫

XPath considers all unprefixed names to be in "no namespace".

In particular the spec says:

"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "

See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.

Hope this helped.

Cheers,

Dimitre Novatchev

初与友歌 2024-07-16 05:03:56

如果您打算仅使用单个名称空间中的标签(正如我在上面的情况下看到的那样),那么使用 lxml.objectify 会更好。

在你的情况下,就像

from lxml import objectify
root = objectify.parse(url) #also available: fromstring

你可以访问节点一样,

root.html
body = root.html.body
for img in body.img: #Assuming all images are within the body tag

虽然它在 html 中可能没有多大帮助,但它在结构良好的 xml 中非常有用。

有关详细信息,请查看 http://lxml.de/objectify.html

If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.

In your case it would be like

from lxml import objectify
root = objectify.parse(url) #also available: fromstring

You can access the nodes as

root.html
body = root.html.body
for img in body.img: #Assuming all images are within the body tag

While it might not be of great help in html, it can be highly useful in well structured xml.

For more info, check out http://lxml.de/objectify.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文