python的编码问题,一个小例子让人很困惑

发布于 2022-09-01 18:48:55 字数 551 浏览 16 评论 0

# -*- coding:utf-8 -*-
'''
Created on 2015年10月8日
'''

def main():
    s = u"你好"
    d = {'id':001, 'text':s}
    s1 = "你好"
    d1 = {'id':002, 'text':s1}
    print d
    print s
    print "------------"
    print d1
    print s1

if __name__ == "__main__": main()

输出为:

{'text': u'\u4f60\u597d', 'id': 1}
你好
------------
{'text': '\xe4\xbd\xa0\xe5\xa5\xbd', 'id': 2}
你好

为何直接打印的都是正常的汉字,但是,字典中的却是\uxxxx 或者 \x.. 之类的呢?
请高手解惑.
PS : 在使用 sqlite3 存储中文时, 及使用scrapy抓取中文数据时, 都遇到上面字典中的情况. 很头疼.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倾听心声的旋律 2022-09-08 18:48:55

Ok，为了清楚解释这个问题，我假设你知道什么是编码，如果不是很清楚，可以移步这里：人机交互之字符编码。下面解释你的这段代码。

# -*- coding:utf-8 -*-

告诉Python解释器你的这个脚本编码方式为"UTF-8"，然后Python解释器直接用“UTF-8”来解码这个脚本文件（当然你得确保文件编码格式确实为UTF-8）。

String vs Unicode String

s1 = u"你好"
s2 = "你好"

s1是一个"str"类型，而s2是一个“unicode”类型，如下：

>>> s1 = "你好"
>>> type(s1)
<type 'str'>
>>> s2 = u"你好"
>>> type(s2)
<type 'unicode'>

这两个类型都是Python的 Sequence Types。

str类型的字符串，内部保存的是a plain sequence of bytes，即任意字符串经过编码后的样子:
```
>>> str_1 = "你好"
>>> str_1
'\xe4\xbd\xa0\xe5\xa5\xbd'
```
这里我的控制台默认是UTF-8编码，所以str_1传入Python解释器的是你好用UTF-8编码后的字节串e4bda0e5a5bd。在你的脚本中，你好也会被用UTF-8编码后传递给str_1。
Unicode 类型的字符串，内部保存的是a sequence of code points，每个码值(code points)均在0 to 0x10ffff之间，在Unicode字符集唯一对应了一个字符。也就是说对于Unicode字符串，解释器看到的是Unicode串中所有字符对应的码值序列。
```
>>> unicode_str2 = u"你好"
>>> unicode_str2
u'\u4f60\u597d'
>>> u"你"
u'\u4f60'
>>> u"好"
u'\u597d'
```
这里你在Unicode字符集对应4f60，好对应597d。

深入了解 print

在了解Python的print机制前，首先要了解对象的两个内建函数 __repr__ 和 __str__

object.__repr__(self): Called by the repr() built-in function and by string conversions (reverse quotes) to compute the “official” string representation of an object. If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value. If this is not possible, a string of the form <...some useful description...> should be returned. The return value must be a string object. If a class defines __repr__() but not __str__(), then __repr__() is also used when an “informal” string representation of instances of that class is required.

object.__str__(self): Called by the str() built-in function and by the print statement to compute the “informal” string representation of an object. This differs from __repr__() in that it does not have to be a valid Python expression: a more convenient or concise representation may be used instead. The return value must be a string object.

当我们在Python中 print object时，实际上会按照下图去执行：

图片描述

对于str类型和unicode类型，内置了__str__函数，返回便于我们阅读的字符串；而对于dict或者list类型，没有__str__函数，因此会调用用来精确描述对象的__repr__。

>>> str.__str__
<slot wrapper '__str__' of 'str' objects>
>>> unicode.__str__
<slot wrapper '__str__' of 'unicode' objects>
>>> dict.__str__
<slot wrapper '__str__' of 'object' objects>
>>> dict.__repr__
<slot wrapper '__repr__' of 'dict' objects>

dict.__str__ 返回的是'object'的__str__，说明dict没有内置__str__。而dict内置了__repr__，因此print dic相当于repr(dict)。

>>> d1 = {'id':002, 'text':"你好"}
>>> print d1
    {'text': '\xe4\xbd\xa0\xe5\xa5\xbd', 'id': 2}
>>> print repr(d1)
{'text': '\xe4\xbd\xa0\xe5\xa5\xbd', 'id': 2}

使用scrapy抓取中文数据时：对于你获取到的数据，首先要知道它的编码格式，然后对其进行相应的编码即可。
在使用 sqlite3 存储中文时：对于你需要保存的数据，只需要将其按照sqlite3数据库的编码要求进行相应的解码即可。

更多内容

关于repr()，文档解释如下：

repr() is meant to generate representations which can be read by the interpreter (or will force a SyntaxError if there is no equivalent syntax).

u"\uxxxx" 和 "\x"表示什么？不感兴趣可以略过。

Escape Sequence	Meaning
\uxxxx	Character with 16-bit hex value xxxx (Unicode only)
\xhh	Character with hex value hh

下面是一些例子

>>> chr(0x41)
'A'
>>> "\x41"
'A'
>>> "\x01" # a non printable character
'\x01'
>>> "\x41abc"
'Aabc'
>>> print u"\u5b66"  # 查Unicode表知道汉字`学`的Unicode码值为U+5b66。
学
>>> print u"\u5b66abc"
学abc

回复收藏 0

不及他 2022-09-08 18:48:55

返回给你的是原始编码而已，你大可以淡定。
python中，print一个非str类型的对象会隐式调用对象的__str__这个方法（实际上就是做转换成字符串的操作）
而dict（也包括list，tuple等很多python内建对象）的__str__方法中，会对字符串做这种编码处理（从而使输出都是ascii编码的字符）

如果你print d['text']或者print d1['text']就可以看到你期望的结果了

给你个例子

class A:
     def __str__(self):
         return "hello"

a = A()
print a

以上程序的的结果是 hello

编辑补充：

特意去查了Python的C代码，就dict这个场景而言，确实是repr而不是str，所以我上面的答案是错误的

详见： https://github.com/python/cpython/blob/2.7/Objects/dictobject.c#L1023

回复收藏 0

何其悲哀 2022-09-08 18:48:55

#调用repr
>>> a=u"你好"
>>> b="你好"
>>> print repr(a)
u'\u4f60\u597d'
>>> print repr(b)
'\xc4\xe3\xba\xc3'

回复收藏 0

~没有更多了~

关于作者

差↓一点笑了

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

python的编码问题,一个小例子让人很困惑

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

String vs Unicode String

深入了解 print

更多内容

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

python的编码问题,一个小例子让人很困惑

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

String vs Unicode String

深入了解 print

更多内容

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。