处理 Python 中古怪的编码

发布于 2024-09-04 17:10:37 字数 386 浏览 1 评论 0原文

我有一个 Python 脚本，可以从许多来源（数据库、文件等）提取数据。据说，所有字符串都是 unicode，但我最终得到的是以下主题的任何变体（由 repr() 返回）：

u'D\\xc3\\xa9cor'
u'D\xc3\xa9cor'
'D\\xc3\\xa9cor'
'D\xc3\xa9cor'

是否有可靠的方法来获取上述字符串中的任何四个并返回正确的 unicode 字符串？

u'D\xe9cor' # --> Décor

我现在能想到的唯一方法是使用eval()、replace()，以及永远无法洗去的深深的、灼热的耻辱。

原文

I have a Python script that pulls in data from many sources (databases, files, etc.). Supposedly, all the strings are unicode, but what I end up getting is any variation on the following theme (as returned by repr()):

u'D\\xc3\\xa9cor'
u'D\xc3\xa9cor'
'D\\xc3\\xa9cor'
'D\xc3\xa9cor'

Is there a reliable way to take any four of the above strings and return the proper unicode string?

u'D\xe9cor' # --> Décor

The only way I can think of right now uses eval(), replace(), and a deep, burning shame that will never wash away.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

手长情犹 2024-09-11 17:10:37

这只是 UTF-8 数据。使用 .decode 将其转换为unicode。

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'

您可以对 'D\\xc3\\xa9cor' 情况执行额外的字符串转义解码。

>>> 'D\xc3\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> u'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

为了处理第二种情况，您需要检测输入是否为 unicode，然后将其首先转换为 str。

>>> def conv(s):
...   if isinstance(s, unicode):
...     s = s.encode('iso-8859-1')
...   return s.decode('string-escape').decode('utf-8')
... 
>>> map(conv, [u'D\\xc3\\xa9cor', u'D\xc3\xa9cor', 'D\\xc3\\xa9cor', 'D\xc3\xa9cor'])
[u'D\xe9cor', u'D\xe9cor', u'D\xe9cor', u'D\xe9cor']

That's just UTF-8 data. Use .decode to convert it into unicode.

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'

You can perform an additional string-escape decode for the 'D\\xc3\\xa9cor' case.

>>> 'D\xc3\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> u'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

To handle the 2nd case as well, you need to detect if the input is unicode, and convert it into a str first.

>>> def conv(s):
...   if isinstance(s, unicode):
...     s = s.encode('iso-8859-1')
...   return s.decode('string-escape').decode('utf-8')
... 
>>> map(conv, [u'D\\xc3\\xa9cor', u'D\xc3\xa9cor', 'D\\xc3\\xa9cor', 'D\xc3\xa9cor'])
[u'D\xe9cor', u'D\xe9cor', u'D\xe9cor', u'D\xe9cor']

回复收藏 0 原文

帝王念 2024-09-11 17:10:37

编写知道应将哪些转换应用于其源的适配器。

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

Write adapters that know which transformations should be applied to their sources.

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

回复收藏 0 原文

人事已非 2024-09-11 17:10:37

这是我在看到 KennyTM 正确的、更简洁的解决方案之前想到的解决方案：

def ensure_unicode(string):
    try:
        string = string.decode('string-escape').decode('string-escape')
    except UnicodeEncodeError:
        string = string.encode('raw_unicode_escape')

    return unicode(string, 'utf-8')

Here's the solution I came to before I saw KennyTM's proper, more concise soltion:

def ensure_unicode(string):
    try:
        string = string.decode('string-escape').decode('string-escape')
    except UnicodeEncodeError:
        string = string.encode('raw_unicode_escape')

    return unicode(string, 'utf-8')

回复收藏 0 原文

~没有更多了~