如何将 Unicode 转换为大写以打印？

发布于 2024-07-17 03:20:36 字数 275 浏览 23 评论 0原文

我有这个：

>>> print 'example'
example
>>> print 'exámple'
exámple
>>> print 'exámple'.upper()
EXáMPLE

我需要做什么来打印：

EXÁMPLE

（其中“a”得到其精确的重音，但为大写。）

我正在使用Python 2.6。

原文

I have this:

>>> print 'example'
example
>>> print 'exámple'
exámple
>>> print 'exámple'.upper()
EXáMPLE

What I need to do to print:

EXÁMPLE

(Where the 'a' gets its accute accent, but in uppercase.)

I'm using Python 2.6.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最丧也最甜 2024-07-24 03:20:36

我认为这就像不首先转换为 ASCII 一样简单。

 >>> print u'exámple'.upper()
 EXÁMPLE

I think it's as simple as not converting to ASCII first.

 >>> print u'exámple'.upper()
 EXÁMPLE

回复收藏 0 原文

卷耳 2024-07-24 03:20:36

在 python 2.x 中，只需在调用 upper() 之前将字符串转换为 unicode 即可。使用您的代码（在此网页上为 utf-8 格式）：

>>> s = 'exámple'
>>> s
'ex\xc3\xa1mple'  # my terminal is not utf8. c3a1 is the UTF-8 hex for á
>>> s.decode('utf-8').upper()
u'EX\xc1MPLE'  # c1 is the utf-16 aka unicode for á

对 decode 的调用将其从当前格式转换为 unicode。然后，您可以使用编码将其转换为其他格式，例如 utf-8。如果字符是 iso-8859-2（在本例中为捷克语等），则可以使用 s.decode('iso-8859-2').upper() 。

就我而言，如果您的终端不兼容 unicode/utf-8，那么您最好希望的是字符的十六进制表示形式（如我的）或使用 s.decode('utf -8').upper().encode('ascii', 'replace')，结果为 'EX?MPLE'。如果您无法使终端显示 unicode，请将输出写入 utf-8 格式的文件，然后在您喜欢的编辑器中打开该文件。

In python 2.x, just convert the string to unicode before calling upper(). Using your code, which is in utf-8 format on this webpage:

>>> s = 'exámple'
>>> s
'ex\xc3\xa1mple'  # my terminal is not utf8. c3a1 is the UTF-8 hex for á
>>> s.decode('utf-8').upper()
u'EX\xc1MPLE'  # c1 is the utf-16 aka unicode for á

The call to decode takes it from its current format to unicode. You can then convert it to some other format, like utf-8, by using encode. If the character was in, say, iso-8859-2 (Czech, etc, in this case), you would instead use s.decode('iso-8859-2').upper().

As in my case, if your terminal is not unicode/utf-8 compliant, the best you can hope for is either a hex representation of the characters (like mine) or to convert it lossily using s.decode('utf-8').upper().encode('ascii', 'replace'), which results in 'EX?MPLE'. If you can't make your terminal show unicode, write the output to a file in utf-8 format and open that in your favourite editor.

回复收藏 0 原文

拥醉 2024-07-24 03:20:36

首先，我这些天只使用 python 3.1；它的核心优点是消除了 unicode 对象中字节字符串的歧义。这使得绝大多数文本操作比以前安全得多。考虑到数万亿个有关 python 2.x 编码问题的用户问题，python 2.1 的 u'äbc 约定只是一个错误；有了明确的bytes和bytearray，生活变得更加容易。

其次，如果 py3k 不适合您，请尝试使用 from __future__ import unicode_literals，因为这将模仿 py3k 在 python 2.6 和 2.7 上的行为。这件事可以避免您在说 print 'exámple'.upper() 时犯的（很容易犯的）错误。本质上，这与 py3k 中的相同：print( 'exámple'.encode( 'utf-8' ).upper() )。比较这些版本（对于 py3k）：

print( 'exámple'.encode( 'utf-8' ).upper() )
print( 'exámple'.encode( 'utf-8' ).upper().decode( 'utf-8' ) )
print( 'exámple'.upper() )

第一个版本基本上是使用裸字符串 'exámple' 时所做的操作，前提是将默认编码设置为 utf-8 > （根据 BDFL 声明，在运行时设置默认编码是一个坏主意，因此在 py2 中，您必须通过说 import sys; reload( sys ); sys.setdefaultencoding( 'utf- 8' ); 我在下面为 py3k 提供了一个更好的解决方案）。当您查看这三行的输出时：

b'EX\xc3\xa1MPLE'
EXáMPLE
EXÁMPLE

您可以看到，当 upper() 应用于第一个文本时，它作用于字节，而不是字符。 python 允许对字节使用 upper() 方法，但它仅在字节的 US-ASCII 解释上定义。由于 utf-8 使用 US-ASCII 8 位之内但之外的值（128 到 255，US-ASCII 不使用），因此这些值不会被受 upper() 影响，因此当我们在第二行解码时，我们得到小写的 á。最后，第三行做得对，是的，令人惊讶的是，python 似乎知道 Á 是与 á 对应的大写字母。我进行了一个快速测试，看看 python 3 不会在大小写之间转换哪些字符：

for cid in range( 3000 ):
  my_chr = chr( cid )
  if my_chr == my_chr.upper() and my_chr == my_chr.lower():
    say( my_chr )

仔细阅读列表，发现拉丁字母、西里尔字母或希腊字母的出现率非常低；大部分输出是非欧洲字符和标点符号。我能发现 python 出错的唯一字符是 °/° (\u0524, \u0525, '西里尔字母 {大写|小} 字母 pe 带下行字母')，所以只要你留在拉丁扩展 X 块之外 (查看这些，它们可能会带来惊喜），您可能实际上会使用该方法。当然，我没有检查映射的正确性。

最后，这是我放入 py3k 应用程序启动部分的内容：一种重新定义 sys.stdout 所看到的编码的方法，以数字字符引用（NCR）作为后备；这会导致打印到标准输出永远不会引发 unicode 编码错误。当我在 ubuntu 上工作时，_sys.stdout.encoding 是 utf-8；当同一个程序在 Windows 上运行时，它可能是像 cp850 这样奇怪的东西。输出可能看起来很奇怪，但应用程序运行时不会在那些愚蠢的终端上引发异常。

#===========================================================================================================
# MAKE STDOUT BEHAVE IN A FAILSAFE MANNER
#-----------------------------------------------------------------------------------------------------------
def _harden_stdout():
  """Ensure that unprintable output to STDOUT does not cause encoding errors; use XML character references
  so any kind of output gets a chance to render in a decipherable way."""
  global _sys_TRM
  _sys.stdout       = _sys_TRM = _sys_io.TextIOWrapper(
    _sys.stdout.buffer,
    encoding        = _sys.stdout.encoding,
    errors          = 'xmlcharrefreplace',
    line_buffering  = true )
#...........................................................................................................
_harden_stdout()

还有一条建议：测试时，始终尝试打印 repr( x ) 或类似的东西来揭示 x 的身份。如果您只是在 py2 中 print x 并且 x 是一个八位字节字符串或一个 unicode 对象，则可能会出现各种误解。这是非常令人费解的，并且很容易引起很多人的困惑。正如我所说，尝试至少使用来自未来的 import unicode 文字咒语移动到 py26。

最后，引用一句话：“Glyph Lefkowitz 在他的文章编码中说得最好：

我相信在这种背景下
讨论中，术语“字符串”是
无意义的。有文字，就有
是面向字节的数据（这可能很
很好地代表了文本，但还没有
转换为它）。在 Python 类型中，
文本是unicode。数据是str。这个想法
“非 Unicode 文本”只是一个
编程错误正在等待发生。”

更新：刚刚发现 python 3 在大写时正确地将 ſ LATIN SMALL LETTER LONG S 转换为 S。整洁！

first off, i only use python 3.1 these days; its central merit is to have disambiguated byte strings from unicode objects. this makes the vast majority of text manipulations much safer than used to be the case. weighing in the trillions of user questions regarding python 2.x encoding problems, the u'äbc convention of python 2.1 was just a mistake; with explicit bytes and bytearray, life becomes so much easier.

secondly, if py3k is not your flavor, then try to go with from __future__ import unicode_literals, as this will mimic py3k's behavior on python 2.6 and 2.7. this thing would have avoided the (easily committed) blunder you did when saying print 'exámple'.upper() . essentially, this is the same as in py3k: print( 'exámple'.encode( 'utf-8' ).upper() ). compare these versions (for py3k):

print( 'exámple'.encode( 'utf-8' ).upper() )
print( 'exámple'.encode( 'utf-8' ).upper().decode( 'utf-8' ) )
print( 'exámple'.upper() )

The first one is, basically, what you did when used a bare string 'exámple', provided you set your default encoding to utf-8 (according to a BDFL pronouncement, setting the default encoding at run time is a bad idea, so in py2 you'll have to trick it by saying import sys; reload( sys ); sys.setdefaultencoding( 'utf-8' ); i present a better solution for py3k below). when you look at the output of these three lines:

b'EX\xc3\xa1MPLE'
EXáMPLE
EXÁMPLE

you can see that when upper() got applied to the first text, it acted on bytes, not on characters. python allows the upper() method on bytes, but it is only defined on the US-ASCII interpretation of bytes. since utf-8 uses values within 8 bits but outside of US-ASCII (128 up to 255, which are not used by US-ASCII), those won't be affected by upper(), so when we decode back in the second line, we get that lower-case á. finally, the third line does it right, and yes, surprise, python seems to be aware that Á is the upper case letter corresponding to á. i ran a quick test to see what characters python 3 does not convert between upper and lower case:

for cid in range( 3000 ):
  my_chr = chr( cid )
  if my_chr == my_chr.upper() and my_chr == my_chr.lower():
    say( my_chr )

perusing the list reveals very few incidences of latin, cyrillic, or greek letters; most of the output is non-european characters and punctuation. the only characters i could find that python got wrong are Ԥ/ԥ (\u0524, \u0525, 'cyrillic {capital|small} letter pe with descender'), so as long as you stay outside of the Latin Extended-X blocks (check out those, they might yield surprises), you might actually use that method. of course, i did not check the correctness of the mappings.

lastly, here is what i put into my py3k application boot section: a method that redefines the encoding sys.stdout sees, with numerical character references (NCRs) as fallback; this has the effect that printing to standard output will never raise a unicode encoding error. when i work on ubuntu, _sys.stdout.encoding is utf-8; when the same program runs on windows, it might be something quaint like cp850. the output might looks starnge, but the application runs without raising an exception on those dim-witted terminals.

#===========================================================================================================
# MAKE STDOUT BEHAVE IN A FAILSAFE MANNER
#-----------------------------------------------------------------------------------------------------------
def _harden_stdout():
  """Ensure that unprintable output to STDOUT does not cause encoding errors; use XML character references
  so any kind of output gets a chance to render in a decipherable way."""
  global _sys_TRM
  _sys.stdout       = _sys_TRM = _sys_io.TextIOWrapper(
    _sys.stdout.buffer,
    encoding        = _sys.stdout.encoding,
    errors          = 'xmlcharrefreplace',
    line_buffering  = true )
#...........................................................................................................
_harden_stdout()

one more piece of advice: when testing, always try to print repr( x ) or a similar thing that reveals the identity of x. all kinds of misunderstandings can crop up if you just print x in py2 and x is either an octet string or a unicode object. it is very puzzling and prone to cause a lot of head-scratching. as i said, try to move at least to py26 with that from future import unicode literals incantation.

and to close, quoting a quote: " Glyph Lefkowitz says it best in his article Encoding:

I believe that in the context of this
discussion, the term "string" is
meaningless. There is text, and there
is byte-oriented data (which may very
well represent text, but is not yet
converted to it). In Python types,
Text is unicode. Data is str. The idea
of "non-Unicode text" is just a
programming error waiting to happen."

update: just found python 3 correctly converts ſ LATIN SMALL LETTER LONG S to S when uppercasing. neat!

回复收藏 0 原文

白龙吟 2024-07-24 03:20:36

我认为我们在这里缺少一些背景知识：

>>> type('hello')
<type 'str'>

>>> type(u'hello')
<type 'unicode'>

只要您使用“unicode”字符串而不是“本机”字符串，像 upper() 这样的运算符就会在考虑到 unicode 的情况下进行操作。 FWIW，Python 3 默认使用 unicode，这使得区别在很大程度上无关紧要。

将字符串从 unicode 转换为 str，然后再返回 unicode 在很多方面都不是最优的，并且如果需要，许多库都会生成 unicode 输出; 因此，只要有可能，请尝试在内部仅对字符串使用 unicode 对象。

I think there's a bit of background we're missing here:

>>> type('hello')
<type 'str'>

>>> type(u'hello')
<type 'unicode'>

As long as you're using "unicode" strings instead of "native" strings, the operators like upper() will operate with unicode in mind. FWIW, Python 3 uses unicode by default, making the distinction largely irrelevant.

Taking a string from unicode to str and then back to unicode is suboptimal in many ways, and many libraries will produce unicode output if you want it; so try to use only unicode objects for strings internally whenever you can.

回复收藏 0 原文