蟒蛇 + lxml:如何查找标签的名称空间?
我正在使用 python + lxml 处理一些 HTML 文件。其中一些已经用 MS Word 编辑过,我们将
标记编写为
实例。 IE 和 Firefox 不会将这些 MS 标签解释为真正的
标签,并且不会在
标签前后显示换行符,并且这就是原始编辑器格式化文件的方式,例如 nbsp 周围没有空格。
另一方面,lxml 很整洁,在处理 HTML 文件后,我们看到所有
标记已更改为正确的
标签。不幸的是,在整理之后,两个浏览器现在都会在所有 nbsp 周围显示换行符,这会破坏原始格式。
因此,我的想法是浏览所有这些
标记,然后删除它们或将其 .text 属性添加到父 .text 属性中,即删除
标签标记。
from lxml import etree
import lxml.html
from StringIO import StringIO
s='<p>somepara</p> <o:p>msoffice_para</o:p>'
parser = lxml.html.HTMLParser()
html=lxml.html.parse( StringIO( s), parser)
for t in html.xpath( "//p"):
print "tag: " + t.tag + ", text: '" + t.text + "'"
结果是:
tag: p, text: 'somepara'
tag: p, text: 'msoffice_para'
因此,lxlm 从标签标记中删除名称空间名称。有没有办法知道哪个
标签来自哪个命名空间,所以我只删除带有
的标签?
谢谢。
I am processing some HTML files with python + lxml. Some of them have been edited with MS Word, and we have <p>
tags written as <o:p> </o:p>
for instance. IE and Firefox do not interpret these MS tags as real <p>
tags, and do not display line breaks before and after the <o:p>
tags, and that is how the original editors have formatted the files, e.g. no spaces around the nbsp's.
lxml on the other hand is tidy, and after processing the HTML files, we see that all the <o:p>
tags have been changed to proper <p>
tags. Unfortunately after this tidying up both browsers now display line breaks around all nbsp's, which breaks the original formatting.
So, my idea was to browse through all those <o:p>
tags and either remove them or add their .text attribute to the parent .text attribute, i.e. remove the <o:p>
tag markers.
from lxml import etree
import lxml.html
from StringIO import StringIO
s='<p>somepara</p> <o:p>msoffice_para</o:p>'
parser = lxml.html.HTMLParser()
html=lxml.html.parse( StringIO( s), parser)
for t in html.xpath( "//p"):
print "tag: " + t.tag + ", text: '" + t.text + "'"
The result is:
tag: p, text: 'somepara'
tag: p, text: 'msoffice_para'
So, lxlm removes the namespace name from the tag marker. Is there a way to know which <p>
tag is from which namespace, so I only remove the ones with <o:p>
?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据 HTML 规范:“HTML 语法不支持命名空间声明” 。
所以我认为
lxml.html.HTMLParser
删除/忽略名称空间。然而,BeautifulSoup 解析 HTML 的方式不同,所以我认为它可能值得一试。如果您还安装了 BeautifulSoup,则可以将 BeautifulSoup 解析器与 lxml 一起使用,如下所示:
BeautifulSoup 不会删除名称空间,但也不会识别名称空间。相反,它只是标签名称的一部分。
也就是说,
行不通。但是这个解决方法/黑客
会产生
From the HTML specs: "The HTML syntax does not support namespace declarations".
So I think
lxml.html.HTMLParser
removes/ignores the namespace.However, BeautifulSoup parses HTML differently so I thought it might be worth a shot. If you also have BeautifulSoup installed, you can use the BeautifulSoup parser with lxml like this:
BeautifulSoup does not remove the namespace, but neither does it recognize the namespace as such. Instead, it is just part of the name of the tag.
That is to say,
does not work. But this workaround/hack
yields
如果 html 实际上格式良好,您可以使用
etree.XMLParser
代替。否则,请尝试 unutbu 的答案。If the html is actually well-formed, you could use the
etree.XMLParser
instead. Otherwise, try unutbu's answer.