Python - 使用 Tidy 进行 HTML 解析

发布于 09-28 02:40 字数 955 浏览 8 评论 0原文

这段代码使用了一些糟糕的 html，使用 Tidy 库来清理它，然后将其传递给 HtmlLib.Reader()。

import tidy
options = dict(output_xhtml=1, 
                add_xml_decl=1, 
                indent=1, 
                tidy_mark=0)

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

看来，我没有使用正确的类型传递 fromString 与此 Traceback：

Traceback (most recent call last):
  File "getComicEmbed.py", line 33, in <module>
    doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found

我应该做些什么不同的事情？谢谢！

原文

This code takes a bit of bad html, uses the Tidy library to clean it up and then passes it to an HtmlLib.Reader().

import tidy
options = dict(output_xhtml=1, 
                add_xml_decl=1, 
                indent=1, 
                tidy_mark=0)

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

I'm not passing fromString with the right type, it seems, with this Traceback:

Traceback (most recent call last):
  File "getComicEmbed.py", line 33, in <module>
    doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found

What should I do differently? Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

安穩2024-10-05 02:40:56

tidy 的 parseString 函数返回一个 _Document 实例，该实例实现 __str__ 但不是缓冲区接口。因此，HtmlLib.Reader().fromString 无法从中创建 StringIO 对象。

这应该相当简单，更改

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

为：

doc = reader.fromString(str(tidy.parseString("<Html>Bad Html.", **options)))

tidy's parseString function returns a _Document instance which implements __str__ but not a buffer interface. Therefore HtmlLib.Reader().fromString cannot create a StringIO object out of it.

This should be fairly simple, change:

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

doc = reader.fromString(str(tidy.parseString("<Html>Bad Html.", **options)))

回复收藏 0 原文

策马西风2024-10-05 02:40:56

我还没有使用过 Python tidy 模块，并且不确定如何找到它，但看起来您需要对 的结果调用 toString 之类的东西>tidy.fromString 将解析后的文档转换回 XHTML。

对于不同的方法，您可以考虑使用 lxml.html，它擅长解析损坏的标记，并为您提供了一个很棒的 ElementTree API 来处理结果。它还可以漂亮地打印 *ML，这使其成为 tidy 的超集，尽管可能不具有导航不连贯标记的相同能力。

另外：lxml 是用 C 编写的（实际上，就像 python tidy 模块一样，只是包装了一个 C 库），因此它比其他一些处理 XML 的 Python 模块要快得多。

回复收藏 0 原文

~没有更多了~

关于作者

断肠人

暂无简介

文章

29 人气

关注发私信

文章 0 评论 0

关注

wkeithbarry

文章 0 评论 0

关注

只有一腔孤勇

文章 0 评论 0

友情链接

文江博客

Python - 使用 Tidy 进行 HTML 解析

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

Serendipity

xxxx

迷离°

wkeithbarry

只有一腔孤勇

友情链接

Python - 使用 Tidy 进行 HTML 解析

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

Serendipity

xxxx

迷离°

wkeithbarry

只有一腔孤勇

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。