UTF-8 和 upper()
我想使用内置函数(例如 upper() 和 Capitalize())转换 UTF-8 字符串。
例如:
>>> mystring = "işğüı"
>>> print mystring.upper()
Işğüı # should be İŞĞÜI instead.
我该如何解决这个问题?
I want to transform UTF-8 strings using built-in functions such as upper() and capitalize().
For example:
>>> mystring = "işğüı"
>>> print mystring.upper()
Işğüı # should be İŞĞÜI instead.
How can I fix this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不要对编码字符串执行操作;首先解码为
unicode
。Do not perform actions on encoded strings; decode to
unicode
first.实际上,作为一般策略,最好在文本进入内存后始终将其保留为 Unicode:在输入时对其进行解码,并在需要输出时对其进行精确编码(如果输入时有特定的编码要求)和/或输入时间。
即使您不选择采用这种一般策略(您应该!),执行您所需的任务的唯一合理方法仍然是再次解码、处理、编码——永远不要处理编码形式。即:
假设您在分配和输出时仅限于编码字符串。 (不幸的是,输出约束是现实的,赋值约束不是——只需执行
mystring = u"işğüı"
,从一开始就使其成为 unicode,并至少保存.decode
打电话!-)It's actually best, as a general strategy, to always keep your text as Unicode once it's in memory: decode it at the moment it's input, and encode it exactly at the moment you need to output it, if there are specific encoding requirements at input and/or input times.
Even if you don't choose to adopt this general strategy (and you should!), the only sound way to perform the task you require is still to decode, process, encode again -- never to work on the encoded forms. I.e.:
assuming you're constrained to encoded strings at assignment and for output purposes. (The output constraint is unfortunately realistic, the assignment constraint isn't -- just do
mystring = u"işğüı"
, making it unicode from the start, and save yourself at least the.decode
call!-)