如何使用 Python 解析包含命名 ISO-8859-1 实体的 HTML?
我总结一下:minidom 似乎不喜欢 8859 个命名实体;什么是合适的解决方案?
下面的代码说明了我的情况:
sample = """
<html>
<body>
<h1>Un ejemplo</h1>
<p>Me llamo Juan Fulano y Hernández.</p>
</body>
</html>
"""
sample2 = sample.replace("á", "á")
import xml.dom.minidom
dom2 = xml.dom.minidom.parseString(sample2)
dom = xml.dom.minidom.parseString(sample)
简而言之:当 HTML 包含 'á' 和类似的内容(表示为命名实体)时,minidom 会抱怨
... xml.parsers.expat.ExpatError: undefined entity ...
我应该如何响应?是否
- 用相应的文字常量替换命名实体?
- 使用 minidom 以外的解析器?哪个?
- 以某种方式(通过编码分配?)让小范围相信这些命名实体很酷?
说服 (X)HTML 的作者避开命名实体不是可行的。
I summarize: minidom appears not to like 8859 named entities; what's an appropriate resolution?
Here's code which illustrates my situation:
sample = """
<html>
<body>
<h1>Un ejemplo</h1>
<p>Me llamo Juan Fulano y Hernández.</p>
</body>
</html>
"""
sample2 = sample.replace("á", "á")
import xml.dom.minidom
dom2 = xml.dom.minidom.parseString(sample2)
dom = xml.dom.minidom.parseString(sample)
Briefly: when the HTML includes 'á' and similar, expressed as named entities, minidom complains
... xml.parsers.expat.ExpatError: undefined entity ...
How should I respond? Do I
- Replace named entities with corresponding literal constants?
- Use a parser other than minidom? Which?
- Somehow (with an encoding assignment?) convince minidom that these named entities are cool?
Not feasible is to convince the author of the (X)HTML to eschew named entities.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
xml.dom.minidom
是 XML 解析器,而不是 HTML 解析器。因此,它不知道任何 HTML 实体(只知道 XML 和 HTML 共有的实体:"
、&
、& ;lt;
、>
和'
)。尝试BeautifulSoup。
xml.dom.minidom
is an XML parser, not an HTML parser. Therefore, it doesn't know any HTML entities (only those which are common to both XML and HTML:"
,&
,<
,>
and'
).Try BeautifulSoup.