使用lxml解析HTML时如何保留名称空间信息？

发布于 2024-11-19 02:59:46 字数 385 浏览 3 评论 0原文

>>> from lxml.etree import HTML, tostring
>>> tostring(HTML('<fb:like>'))
'<html><body><like/></body></html>'

请注意标记如何从变为简单的。

这使得处理将 XFBML 与 lxml 结合在一起的页面变得更加困难。（也会发生同样的情况）

任何帮助都会受到赞赏。

原文

>>> from lxml.etree import HTML, tostring
>>> tostring(HTML('<fb:like>'))
'<html><body><like/></body></html>'

Note how the tag turns from <fb:like> to simply <like>.

This makes processing pages that incorporate XFBML with lxml much harder. (Same thing happens to <g:plusone></g:plusone>)

Any help is appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

仅一夜美梦 2024-11-26 02:59:46

解决此问题的一种方法是修补 libxml2。

参考libxml2.9.2的源码（https://git.gnome.org/browse/libxml2/tree/?id=v2.9.2），在SAX2.c中（https://git.gnome.org/browse /libxml2/tree/SAX2.c?id=v2.9.2) （用于创建 DOM 树的内部 SAX 解析器）在第 1699 行，带有 xmlns 的属性在以下情况下不会被解析在 HTML 模式下，它们会像第 1740 行和第 1740 行中的任何其他属性一样进行解析。因此，调整第 1622 行是有意义的，该行将名称拆分为前缀和本地部分。改为：

name = xmlSplitQName(ctxt, fullname, &prefix);

然后

if (!ctxt->html) {
    name = xmlSplitQName(ctxt, fullname, &prefix);
} else {
    name = xmlStrdup(fullname);
    prefix = NULL;
}

libxml2 会认为等标签是针对名为 o:p 的元素，即元素中包含冒号没有特殊含义的名字。这是 HTML 中的正确解释。例如，HTML5 规范表示：

在 HTML 语法中，命名空间前缀和命名空间声明与 XML 中的效果不同。例如，冒号在 HTML 元素名称中没有特殊含义。

希望 libxml2 的未来版本能够批准这一更改。有一个开放的错误报告（https://bugzilla.gnome.org/show_bug.cgi?id=654146）。

One way to fix this issue is to patch libxml2.

Referring to the source code of libxml2.9.2 (https: //git.gnome.org/browse/libxml2/tree/?id=v2.9.2), in SAX2.c (https: //git.gnome.org/browse/libxml2/tree/SAX2.c?id=v2.9.2) (the internal SAX parser used to create the DOM tree) at line 1699 attributes with xmlns are not parsed when in HTML mode, and they are parsed like any other attributes at line and 1740. Consequently, it makes sense to adjust line 1622, which splits the name into prefix and local part. Change:

name = xmlSplitQName(ctxt, fullname, &prefix);

into

if (!ctxt->html) {
    name = xmlSplitQName(ctxt, fullname, &prefix);
} else {
    name = xmlStrdup(fullname);
    prefix = NULL;
}

Then libxml2 will consider tags such as <o:p> to be for elements with name o:p, that is, the colon is included in the element name with no special meaning. This is the correct interpretation in HTML. For example, the HTML5 specification says:

In the HTML syntax, namespace prefixes and namespace declarations do not have the same effect as in XML. For instance, the colon has no special meaning in HTML element names.

Hopefully this change will be approved for a future version of libxml2. There is an open bug report (https: //bugzilla.gnome.org/show_bug.cgi?id=654146).

回复收藏 0 原文