如何在 Python 中将扩展 ASCII 转换为 HTML 实体名称?
我目前正在这样做,以将扩展 ascii 字符替换为 HTML 实体编号等效项:
s.encode('ascii', 'xmlcharrefreplace')
我想要做的是将其转换为 HTML 实体名称等效项(即 ©
而不是©
)。下面的这个小程序显示了我正在尝试做的事情失败了。除了查找/替换之外,还有其他方法可以做到这一点吗?
#coding=latin-1
def convertEntities(s):
return s.encode('ascii', 'xmlcharrefreplace')
ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'
ok_expected = ok
not_ok_expected = u'extended-ascii: ©®°±¼'
ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)
if ok_2 == ok_expected:
print 'ascii worked'
else:
print 'ascii failed: "%s"' % ok_2
if not_ok_2 == not_ok_expected:
print 'extended-ascii worked'
else:
print 'extended-ascii failed: "%s"' % not_ok_2
I'm currently doing this to replace extended-ascii characters with their HTML-entity-number equivalents:
s.encode('ascii', 'xmlcharrefreplace')
What I would like to do is convert to the HTML-entity-name equivalent (i.e. ©
instead of ©
). This small program below shows what I'm trying to do that is failing. Is there a way to do this, aside from doing a find/replace?
#coding=latin-1
def convertEntities(s):
return s.encode('ascii', 'xmlcharrefreplace')
ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'
ok_expected = ok
not_ok_expected = u'extended-ascii: ©®°±¼'
ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)
if ok_2 == ok_expected:
print 'ascii worked'
else:
print 'ascii failed: "%s"' % ok_2
if not_ok_2 == not_ok_expected:
print 'extended-ascii worked'
else:
print 'extended-ascii failed: "%s"' % not_ok_2
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
htmlentitydefs
是您想要的吗?Is
htmlentitydefs
what you want?编辑
其他人提到了我从来不知道的
htmlentitydefs
。它将以这种方式与我的代码一起工作:这应该可以工作。
edit
Others have mentioned the
htmlentitydefs
that I never knew about. It would work with my code this way:And that should work.
我不确定如何直接,但我认为
htmlentitydefs
模块会有用。可以在此处找到示例。I'm not sure how directly but I think the
htmlentitydefs
module will be of use. An example can be found here.更新 这是我要使用的解决方案,通过一个小修复来检查entitydefs 是否包含我们所拥有的字符的映射。
Update This is the solution I'm going with, with a small fix to check that entitydefs contains a mapping for the character we have.
您确定不希望转换是可逆的吗?您的
ok_expected
字符串表明您不希望转义现有的&
字符,因此转换将是一种方式。下面的代码假设&
应该被转义,但如果您确实不想要的话,只需删除cgi.escape
即可。无论如何,我会将您的原始方法与正则表达式替换结合起来:像以前一样进行编码,然后修复数字实体。这样您就不会最终通过 getEntity 函数映射每个字符。
Are you sure that you don't want the conversion to be reversible? Your
ok_expected
string indicates you don't want existing&
characters escaped, so the conversion will be one way. The code below assumes that&
should be escaped, but just remove thecgi.escape
if you really don't want that.Anyway, I'd combine your original approach with a regular expression substitution: do the encoding as before and then just fix up the numeric entities. That way you don't end up mapping every single character through your getEntity function.