如何在 Python 中将扩展 ASCII 转换为 HTML 实体名称?

发布于 2024-09-10 16:33:43 字数 825 浏览 4 评论 0原文

我目前正在这样做,以将扩展 ascii 字符替换为 HTML 实体编号等效项:

s.encode('ascii', 'xmlcharrefreplace')

我想要做的是将其转换为 HTML 实体名称等效项(即 ©而不是©)。下面的这个小程序显示了我正在尝试做的事情失败了。除了查找/替换之外,还有其他方法可以做到这一点吗?

#coding=latin-1

def convertEntities(s):
    return s.encode('ascii', 'xmlcharrefreplace')

ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'

ok_expected = ok
not_ok_expected = u'extended-ascii: &copy;&reg;&deg;&plusmn;&frac14;'

ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)

if ok_2 == ok_expected:
    print 'ascii worked'
else:
    print 'ascii failed: "%s"' % ok_2

if not_ok_2 == not_ok_expected:
    print 'extended-ascii worked'
else:
    print 'extended-ascii failed: "%s"' % not_ok_2

I'm currently doing this to replace extended-ascii characters with their HTML-entity-number equivalents:

s.encode('ascii', 'xmlcharrefreplace')

What I would like to do is convert to the HTML-entity-name equivalent (i.e. © instead of ©). This small program below shows what I'm trying to do that is failing. Is there a way to do this, aside from doing a find/replace?

#coding=latin-1

def convertEntities(s):
    return s.encode('ascii', 'xmlcharrefreplace')

ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'

ok_expected = ok
not_ok_expected = u'extended-ascii: ©®°±¼'

ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)

if ok_2 == ok_expected:
    print 'ascii worked'
else:
    print 'ascii failed: "%s"' % ok_2

if not_ok_2 == not_ok_expected:
    print 'extended-ascii worked'
else:
    print 'extended-ascii failed: "%s"' % not_ok_2

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

清风挽心 2024-09-17 16:33:43

htmlentitydefs 是您想要的吗?

import htmlentitydefs
htmlentitydefs.codepoint2name.get(ord(c),c)

Is htmlentitydefs what you want?

import htmlentitydefs
htmlentitydefs.codepoint2name.get(ord(c),c)
神爱温柔 2024-09-17 16:33:43

编辑

其他人提到了我从来不知道的htmlentitydefs。它将以这种方式与我的代码一起工作:

from htmlentitydefs import entitydefs as symbols

for tag, val in symbols.iteritems():
   mystr = mystr.replace("&{0};".format(tag), val)

这应该可以工作。

edit

Others have mentioned the htmlentitydefs that I never knew about. It would work with my code this way:

from htmlentitydefs import entitydefs as symbols

for tag, val in symbols.iteritems():
   mystr = mystr.replace("&{0};".format(tag), val)

And that should work.

鸠魁 2024-09-17 16:33:43

我不确定如何直接,但我认为 htmlentitydefs 模块会有用。可以在此处找到示例。

I'm not sure how directly but I think the htmlentitydefs module will be of use. An example can be found here.

雪若未夕 2024-09-17 16:33:43

更新 这是我要使用的解决方案,通过一个小修复来检查entitydefs 是否包含我们所拥有的字符的映射。

def convertEntities(s):
    return ''.join([getEntity(c) for c in s])

def getEntity(c):
    ord_c = ord(c)
    if ord_c > 127 and ord_c in htmlentitydefs.codepoint2name:
        return "&%s;" % htmlentitydefs.codepoint2name[ord_c]
    return c

Update This is the solution I'm going with, with a small fix to check that entitydefs contains a mapping for the character we have.

def convertEntities(s):
    return ''.join([getEntity(c) for c in s])

def getEntity(c):
    ord_c = ord(c)
    if ord_c > 127 and ord_c in htmlentitydefs.codepoint2name:
        return "&%s;" % htmlentitydefs.codepoint2name[ord_c]
    return c
丶情人眼里出诗心の 2024-09-17 16:33:43

您确定不希望转换是可逆的吗?您的 ok_expected 字符串表明您不希望转义现有的 & 字符,因此转换将是一种方式。下面的代码假设 & 应该被转义,但如果您确实不想要的话,只需删除 cgi.escape 即可。

无论如何,我会将您的原始方法与正则表达式替换结合起来:像以前一样进行编码,然后修复数字实体。这样您就不会最终通过 getEntity 函数映射每个字符。

#coding=latin-1
import cgi
import re
import htmlentitydefs

def replace_entity(match):
    c = int(match.group(1))
    name = htmlentitydefs.codepoint2name.get(c, None)
    if name:
        return "&%s;" % name
    return match.group(0)

def convertEntities(s):
    s = cgi.escape(s) # Remove if you want ok_expected to pass!
    s = s.encode('ascii', 'xmlcharrefreplace')
    s = re.sub("&#([0-9]+);", replace_entity, s)
    return s

ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'

ok_expected = ok
not_ok_expected = u'extended-ascii: ©®°±¼'

ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)

if ok_2 == ok_expected:
    print 'ascii worked'
else:
    print 'ascii failed: "%s"' % ok_2

if not_ok_2 == not_ok_expected:
    print 'extended-ascii worked'
else:
    print 'extended-ascii failed: "%s"' % not_ok_2

Are you sure that you don't want the conversion to be reversible? Your ok_expected string indicates you don't want existing & characters escaped, so the conversion will be one way. The code below assumes that & should be escaped, but just remove the cgi.escape if you really don't want that.

Anyway, I'd combine your original approach with a regular expression substitution: do the encoding as before and then just fix up the numeric entities. That way you don't end up mapping every single character through your getEntity function.

#coding=latin-1
import cgi
import re
import htmlentitydefs

def replace_entity(match):
    c = int(match.group(1))
    name = htmlentitydefs.codepoint2name.get(c, None)
    if name:
        return "&%s;" % name
    return match.group(0)

def convertEntities(s):
    s = cgi.escape(s) # Remove if you want ok_expected to pass!
    s = s.encode('ascii', 'xmlcharrefreplace')
    s = re.sub("&#([0-9]+);", replace_entity, s)
    return s

ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'

ok_expected = ok
not_ok_expected = u'extended-ascii: ©®°±¼'

ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)

if ok_2 == ok_expected:
    print 'ascii worked'
else:
    print 'ascii failed: "%s"' % ok_2

if not_ok_2 == not_ok_expected:
    print 'extended-ascii worked'
else:
    print 'extended-ascii failed: "%s"' % not_ok_2
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文