Python - 使用 Tidy 进行 HTML 解析

发布于 2024-09-28 02:40:56 字数 955 浏览 1 评论 0原文

这段代码使用了一些糟糕的 html,使用 Tidy 库来清理它,然后将其传递给 HtmlLib.Reader()。

import tidy
options = dict(output_xhtml=1, 
                add_xml_decl=1, 
                indent=1, 
                tidy_mark=0)

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

看来,我没有使用正确的类型传递 fromString 与此 Traceback:

Traceback (most recent call last):
  File "getComicEmbed.py", line 33, in <module>
    doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found

我应该做些什么不同的事情?谢谢!

This code takes a bit of bad html, uses the Tidy library to clean it up and then passes it to an HtmlLib.Reader().

import tidy
options = dict(output_xhtml=1, 
                add_xml_decl=1, 
                indent=1, 
                tidy_mark=0)

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

I'm not passing fromString with the right type, it seems, with this Traceback:

Traceback (most recent call last):
  File "getComicEmbed.py", line 33, in <module>
    doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found

What should I do differently? Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

安穩 2024-10-05 02:40:56

tidy 的 parseString 函数返回一个 _Document 实例,该实例实现 __str__ 但不是缓冲区接口。因此,HtmlLib.Reader().fromString 无法从中创建 StringIO 对象。

这应该相当简单,更改

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

为:

doc = reader.fromString(str(tidy.parseString("<Html>Bad Html.", **options)))

tidy's parseString function returns a _Document instance which implements __str__ but not a buffer interface. Therefore HtmlLib.Reader().fromString cannot create a StringIO object out of it.

This should be fairly simple, change:

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

to

doc = reader.fromString(str(tidy.parseString("<Html>Bad Html.", **options)))
策马西风 2024-10-05 02:40:56

我还没有使用过 Python tidy 模块,并且不确定如何找到它,但看起来您需要对 的结果调用 toString 之类的东西>tidy.fromString 将解析后的文档转换回 XHTML。

对于不同的方法,您可以考虑使用 lxml.html,它擅长解析损坏的标记,并为您提供了一个很棒的 ElementTree API 来处理结果。它还可以漂亮地打印 *ML,这使其成为 tidy 的超集,尽管可能不具有导航不连贯标记的相同能力。

另外:lxml 是用 C 编写的(实际上,就像 python tidy 模块一样,只是包装了一个 C 库),因此它比其他一些处理 XML 的 Python 模块要快得多。

I haven't used the Python tidy module, and am not sure how to find it, but it looks like you need to call something like toString on the result of tidy.fromString to convert your parsed document back into XHTML.

For a different approach, you could consider using lxml.html, which is decent at parsing broken markup and provides you with a great ElementTree API for working with the result. It can also pretty-print *ML, which makes it sort of a superset of tidy, though perhaps not with quite the same ability to navigate incoherent markup.

Also: lxml is written in C (actually, like the python tidy module(s), just wraps a C library) so it's much faster than some of the other python modules for working with XML.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文