Python - 使用 Tidy 进行 HTML 解析
这段代码使用了一些糟糕的 html,使用 Tidy 库来清理它,然后将其传递给 HtmlLib.Reader()。
import tidy
options = dict(output_xhtml=1,
add_xml_decl=1,
indent=1,
tidy_mark=0)
from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()
doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))
看来,我没有使用正确的类型传递 fromString 与此 Traceback:
Traceback (most recent call last):
File "getComicEmbed.py", line 33, in <module>
doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found
我应该做些什么不同的事情?谢谢!
This code takes a bit of bad html, uses the Tidy library to clean it up and then passes it to an HtmlLib.Reader().
import tidy
options = dict(output_xhtml=1,
add_xml_decl=1,
indent=1,
tidy_mark=0)
from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()
doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))
I'm not passing fromString with the right type, it seems, with this Traceback:
Traceback (most recent call last):
File "getComicEmbed.py", line 33, in <module>
doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found
What should I do differently? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
tidy 的
parseString
函数返回一个_Document
实例,该实例实现__str__
但不是缓冲区接口。因此,HtmlLib.Reader().fromString
无法从中创建StringIO
对象。这应该相当简单,更改
为:
tidy's
parseString
function returns a_Document
instance which implements__str__
but not a buffer interface. ThereforeHtmlLib.Reader().fromString
cannot create aStringIO
object out of it.This should be fairly simple, change:
to
我还没有使用过 Python
tidy
模块,并且不确定如何找到它,但看起来您需要对的结果调用
将解析后的文档转换回 XHTML。toString
之类的东西>tidy.fromString对于不同的方法,您可以考虑使用
lxml.html
,它擅长解析损坏的标记,并为您提供了一个很棒的 ElementTree API 来处理结果。它还可以漂亮地打印 *ML,这使其成为 tidy 的超集,尽管可能不具有导航不连贯标记的相同能力。另外:lxml 是用 C 编写的(实际上,就像 python
tidy
模块一样,只是包装了一个 C 库),因此它比其他一些处理 XML 的 Python 模块要快得多。I haven't used the Python
tidy
module, and am not sure how to find it, but it looks like you need to call something liketoString
on the result oftidy.fromString
to convert your parsed document back into XHTML.For a different approach, you could consider using
lxml.html
, which is decent at parsing broken markup and provides you with a great ElementTree API for working with the result. It can also pretty-print *ML, which makes it sort of a superset of tidy, though perhaps not with quite the same ability to navigate incoherent markup.Also: lxml is written in C (actually, like the python
tidy
module(s), just wraps a C library) so it's much faster than some of the other python modules for working with XML.