lxml 是否根据上下文解析 HTML？

发布于 2024-12-01 04:59:16 字数 549 浏览 0 评论 0原文

我使用 lxml 来解析 HTML：

>>> from lxml.html import fromstring, tostring

在某些情况下它会正确解析尾随空格：

>>> html = """<div>some <i>text</i> </div>"""
>>> html == tostring(fromstring(html))
True

但在遇到未知标签（例如下面的 blah 标签）时它似乎会中断。

>>> html = """<div>some <blah>text</blah> </div>"""
>>> html == tostring(fromstring(html))
False

如何修复它以包含所有标签的尾随空格？

原文

I'm using lxml to parse HTML:

>>> from lxml.html import fromstring, tostring

It parses trailing whitespace correctly in some cases:

>>> html = """<div>some <i>text</i> </div>"""
>>> html == tostring(fromstring(html))
True

But it seems to break when encountering unknown tags (such as the blah tag below).

>>> html = """<div>some <blah>text</blah> </div>"""
>>> html == tostring(fromstring(html))
False

How can I fix it to include trailing whitespace for all tags?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

盛装女皇 2024-12-08 04:59:16

这似乎是由于 libxml2 的行为造成的（我已从下面的版本中删除了一些错误消息）：

>>> print libxml2.htmlParseDoc("""<div>some <blah>text</blah> </div>""", "UTF-8")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>some <blah>text</blah></div></body></html>


>>> print libxml2.htmlParseDoc("""<div>some <i>text</i> </div>""", "UTF-8")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>some <i>text</i> </div></body></html>

我仍在探索解决方法。 libxml2 的 XML 解析器不会表现出这种行为，但我认为它在损坏的 html 上工作会更糟。

This appears to be due to the behavior of libxml2 (I've removed some error messages from the version below):

>>> print libxml2.htmlParseDoc("""<div>some <blah>text</blah> </div>""", "UTF-8")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>some <blah>text</blah></div></body></html>


>>> print libxml2.htmlParseDoc("""<div>some <i>text</i> </div>""", "UTF-8")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>some <i>text</i> </div></body></html>

I am still probing for a workaround. libxml2's XML parser doesn't exhibit this behavior, but I think it would work a lot worse on broken html.

回复收藏 0 原文

风为裳 2024-12-08 04:59:16

您需要在解析器本身中设置一个标志来删除空格。我在解析 xml 时这样做是这样的：

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)

data = etree.parse(open(file),parser)

You need to set a flag in the parser itself to remove whitespace. I've done this when parsing xml like this:

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)

data = etree.parse(open(file),parser)

回复收藏 0 原文

~没有更多了~

关于作者

萌梦深

暂无简介

0 文章

0 评论

25 人气

关注发私信

马化腾

文章 0 评论 0

关注

thousandcents

文章 0 评论 0

关注

辰『辰』

文章 0 评论 0

关注

ailin001

文章 0 评论 0

关注

再摆5分钟就干活

文章 0 评论 0

关注

冷情妓

文章 0 评论 0

友情链接

文江博客

lxml 是否根据上下文解析 HTML？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

马化腾

thousandcents

辰『辰』

ailin001

再摆5分钟就干活

冷情妓

友情链接

lxml 是否根据上下文解析 HTML？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

马化腾

thousandcents

辰『辰』

ailin001

再摆5分钟就干活

冷情妓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。