如何将 Unicode 转换为大写以打印?
我有这个:
>>> print 'example'
example
>>> print 'exámple'
exámple
>>> print 'exámple'.upper()
EXáMPLE
我需要做什么来打印:
EXÁMPLE
(其中“a”得到其精确的重音,但为大写。)
我正在使用Python 2.6。
I have this:
>>> print 'example'
example
>>> print 'exámple'
exámple
>>> print 'exámple'.upper()
EXáMPLE
What I need to do to print:
EXÁMPLE
(Where the 'a' gets its accute accent, but in uppercase.)
I'm using Python 2.6.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我认为这就像不首先转换为 ASCII 一样简单。
I think it's as simple as not converting to ASCII first.
在 python 2.x 中,只需在调用 upper() 之前将字符串转换为 unicode 即可。 使用您的代码(在此网页上为 utf-8 格式):
对
decode
的调用将其从当前格式转换为 unicode。 然后,您可以使用编码将其转换为其他格式,例如 utf-8。 如果字符是 iso-8859-2(在本例中为捷克语等),则可以使用s.decode('iso-8859-2').upper()
。就我而言,如果您的终端不兼容 unicode/utf-8,那么您最好希望的是字符的十六进制表示形式(如我的)或使用
s.decode('utf -8').upper().encode('ascii', 'replace')
,结果为 'EX?MPLE'。 如果您无法使终端显示 unicode,请将输出写入 utf-8 格式的文件,然后在您喜欢的编辑器中打开该文件。In python 2.x, just convert the string to unicode before calling upper(). Using your code, which is in utf-8 format on this webpage:
The call to
decode
takes it from its current format to unicode. You can then convert it to some other format, like utf-8, by using encode. If the character was in, say, iso-8859-2 (Czech, etc, in this case), you would instead uses.decode('iso-8859-2').upper()
.As in my case, if your terminal is not unicode/utf-8 compliant, the best you can hope for is either a hex representation of the characters (like mine) or to convert it lossily using
s.decode('utf-8').upper().encode('ascii', 'replace')
, which results in 'EX?MPLE'. If you can't make your terminal show unicode, write the output to a file in utf-8 format and open that in your favourite editor.首先,我这些天只使用 python 3.1; 它的核心优点是消除了 unicode 对象中字节字符串的歧义。 这使得绝大多数文本操作比以前安全得多。 考虑到数万亿个有关 python 2.x 编码问题的用户问题,python 2.1 的
u'äbc
约定只是一个错误; 有了明确的bytes
和bytearray
,生活变得更加容易。其次,如果 py3k 不适合您,请尝试使用
from __future__ import unicode_literals
,因为这将模仿 py3k 在 python 2.6 和 2.7 上的行为。 这件事可以避免您在说print 'exámple'.upper()
时犯的(很容易犯的)错误。 本质上,这与 py3k 中的相同:print( 'exámple'.encode( 'utf-8' ).upper() )
。 比较这些版本(对于 py3k):第一个版本基本上是使用裸字符串
'exámple'
时所做的操作,前提是将默认编码设置为utf-8
> (根据 BDFL 声明,在运行时设置默认编码是一个坏主意,因此在 py2 中,您必须通过说import sys; reload( sys ); sys.setdefaultencoding( 'utf- 8' )
; 我在下面为 py3k 提供了一个更好的解决方案)。 当您查看这三行的输出时:您可以看到,当
upper()
应用于第一个文本时,它作用于字节,而不是字符。 python 允许对字节使用upper()
方法,但它仅在字节的 US-ASCII 解释上定义。 由于 utf-8 使用 US-ASCII 8 位之内但之外的值(128 到 255,US-ASCII 不使用),因此这些值不会被受upper()
影响,因此当我们在第二行解码时,我们得到小写的á
。 最后,第三行做得对,是的,令人惊讶的是,python 似乎知道Á
是与á
对应的大写字母。 我进行了一个快速测试,看看 python 3 不会在大小写之间转换哪些字符:仔细阅读列表,发现拉丁字母、西里尔字母或希腊字母的出现率非常低; 大部分输出是非欧洲字符和标点符号。 我能发现 python 出错的唯一字符是 °/° (\u0524, \u0525, '西里尔字母 {大写|小} 字母 pe 带下行字母'),所以只要你留在拉丁扩展 X 块之外 (查看这些,它们可能会带来惊喜),您可能实际上会使用该方法。 当然,我没有检查映射的正确性。
最后,这是我放入 py3k 应用程序启动部分的内容:一种重新定义 sys.stdout 所看到的编码的方法,以数字字符引用(NCR)作为后备; 这会导致打印到标准输出永远不会引发 unicode 编码错误。 当我在 ubuntu 上工作时,
_sys.stdout.encoding
是utf-8
; 当同一个程序在 Windows 上运行时,它可能是像 cp850 这样奇怪的东西。 输出可能看起来很奇怪,但应用程序运行时不会在那些愚蠢的终端上引发异常。还有一条建议:测试时,始终尝试打印 repr( x ) 或类似的东西来揭示 x 的身份。 如果您只是在 py2 中
print x
并且x
是一个八位字节字符串或一个 unicode 对象,则可能会出现各种误解。 这是非常令人费解的,并且很容易引起很多人的困惑。 正如我所说,尝试至少使用来自未来的 import unicode 文字咒语移动到 py26。最后,引用一句话:“Glyph Lefkowitz 在他的文章 编码 中说得最好:
更新:刚刚发现 python 3 在大写时正确地将 ſ LATIN SMALL LETTER LONG S 转换为 S。整洁!
first off, i only use python 3.1 these days; its central merit is to have disambiguated byte strings from unicode objects. this makes the vast majority of text manipulations much safer than used to be the case. weighing in the trillions of user questions regarding python 2.x encoding problems, the
u'äbc
convention of python 2.1 was just a mistake; with explicitbytes
andbytearray
, life becomes so much easier.secondly, if py3k is not your flavor, then try to go with
from __future__ import unicode_literals
, as this will mimic py3k's behavior on python 2.6 and 2.7. this thing would have avoided the (easily committed) blunder you did when sayingprint 'exámple'.upper()
. essentially, this is the same as in py3k:print( 'exámple'.encode( 'utf-8' ).upper() )
. compare these versions (for py3k):The first one is, basically, what you did when used a bare string
'exámple'
, provided you set your default encoding toutf-8
(according to a BDFL pronouncement, setting the default encoding at run time is a bad idea, so in py2 you'll have to trick it by sayingimport sys; reload( sys ); sys.setdefaultencoding( 'utf-8' )
; i present a better solution for py3k below). when you look at the output of these three lines:you can see that when
upper()
got applied to the first text, it acted on bytes, not on characters. python allows theupper()
method on bytes, but it is only defined on the US-ASCII interpretation of bytes. since utf-8 uses values within 8 bits but outside of US-ASCII (128 up to 255, which are not used by US-ASCII), those won't be affected byupper()
, so when we decode back in the second line, we get that lower-caseá
. finally, the third line does it right, and yes, surprise, python seems to be aware thatÁ
is the upper case letter corresponding toá
. i ran a quick test to see what characters python 3 does not convert between upper and lower case:perusing the list reveals very few incidences of latin, cyrillic, or greek letters; most of the output is non-european characters and punctuation. the only characters i could find that python got wrong are Ԥ/ԥ (\u0524, \u0525, 'cyrillic {capital|small} letter pe with descender'), so as long as you stay outside of the Latin Extended-X blocks (check out those, they might yield surprises), you might actually use that method. of course, i did not check the correctness of the mappings.
lastly, here is what i put into my py3k application boot section: a method that redefines the encoding
sys.stdout
sees, with numerical character references (NCRs) as fallback; this has the effect that printing to standard output will never raise a unicode encoding error. when i work on ubuntu,_sys.stdout.encoding
isutf-8
; when the same program runs on windows, it might be something quaint likecp850
. the output might looks starnge, but the application runs without raising an exception on those dim-witted terminals.one more piece of advice: when testing, always try to
print repr( x )
or a similar thing that reveals the identity ofx
. all kinds of misunderstandings can crop up if you justprint x
in py2 andx
is either an octet string or a unicode object. it is very puzzling and prone to cause a lot of head-scratching. as i said, try to move at least to py26 with that from future import unicode literals incantation.and to close, quoting a quote: " Glyph Lefkowitz says it best in his article Encoding:
update: just found python 3 correctly converts ſ LATIN SMALL LETTER LONG S to S when uppercasing. neat!
我认为我们在这里缺少一些背景知识:
只要您使用“unicode”字符串而不是“本机”字符串,像 upper() 这样的运算符就会在考虑到 unicode 的情况下进行操作。 FWIW,Python 3 默认使用 unicode,这使得区别在很大程度上无关紧要。
将字符串从
unicode
转换为str
,然后再返回unicode
在很多方面都不是最优的,并且如果需要,许多库都会生成 unicode 输出; 因此,只要有可能,请尝试在内部仅对字符串使用unicode
对象。I think there's a bit of background we're missing here:
As long as you're using "unicode" strings instead of "native" strings, the operators like upper() will operate with unicode in mind. FWIW, Python 3 uses unicode by default, making the distinction largely irrelevant.
Taking a string from
unicode
tostr
and then back tounicode
is suboptimal in many ways, and many libraries will produce unicode output if you want it; so try to use onlyunicode
objects for strings internally whenever you can.尝试一下:
Try it: