处理 Python 中古怪的编码
我有一个 Python 脚本,可以从许多来源(数据库、文件等)提取数据。据说,所有字符串都是 unicode,但我最终得到的是以下主题的任何变体(由 repr()
返回):
u'D\\xc3\\xa9cor'
u'D\xc3\xa9cor'
'D\\xc3\\xa9cor'
'D\xc3\xa9cor'
是否有可靠的方法来获取上述字符串中的任何四个并返回正确的 unicode 字符串?
u'D\xe9cor' # --> Décor
我现在能想到的唯一方法是使用eval()
、replace()
,以及永远无法洗去的深深的、灼热的耻辱。
I have a Python script that pulls in data from many sources (databases, files, etc.). Supposedly, all the strings are unicode, but what I end up getting is any variation on the following theme (as returned by repr()
):
u'D\\xc3\\xa9cor'
u'D\xc3\xa9cor'
'D\\xc3\\xa9cor'
'D\xc3\xa9cor'
Is there a reliable way to take any four of the above strings and return the proper unicode string?
u'D\xe9cor' # --> Décor
The only way I can think of right now uses eval()
, replace()
, and a deep, burning shame that will never wash away.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这只是 UTF-8 数据。使用
.decode
将其转换为unicode
。您可以对
'D\\xc3\\xa9cor'
情况执行额外的字符串转义解码。为了处理第二种情况,您需要检测输入是否为
unicode
,然后将其首先转换为str
。That's just UTF-8 data. Use
.decode
to convert it intounicode
.You can perform an additional string-escape decode for the
'D\\xc3\\xa9cor'
case.To handle the 2nd case as well, you need to detect if the input is
unicode
, and convert it into astr
first.编写知道应将哪些转换应用于其源的适配器。
Write adapters that know which transformations should be applied to their sources.
这是我在看到 KennyTM 正确的、更简洁的解决方案之前想到的解决方案:
Here's the solution I came to before I saw KennyTM's proper, more concise soltion: