如何解决使用 Python 解码和打印希腊字符的困难？

发布于 2024-11-09 17:01:10 字数 3176 浏览 4 评论 0原文

我正在创建一个简单的游戏，旨在提示用户输入英语单词的希腊语翻译。例如：

cow: # here, the gamer would answer with *η αγελάδα* in order to score one point.

我使用辅助函数来读取和解码 txt 文件。我在上述函数中使用以下代码来执行此操作：

# The variable filename refers to my helper function's sole parameter, it takes the 
# above mentioned txt file as an argument.
words_text = codecs.open(filename, 'r', 'utf-8')

然后，该辅助函数读取每一行。这些行类似于这样：

# In stack data, when I debug, it reads as u"\η αγελάδα - cow\r\n".
u"\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1 - cow\r\n"

但是，读取文件时的第一行有一个不需要的前缀，ueff-:

# u"\ufeffη αγελάδα - cow\r\n"
u"\ufeff\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1 - cow\r\n"

注意：在查看 Mark 的答案后，我发现前置对象 (ueff) 是 BOM 签名（它是用于区分 UTF-8 和其他编码）。

这是一个小问题，我不知道如何以最简洁的方式删除它。无论如何，我的辅助函数然后创建并返回一个新的字典，它看起来像这样：

{u'\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1': 'cow'}

然后，在我的主函数中，我使用以下内容来存储用户的输入：

# This is the code for the prompt I noted at the beginning.
# The variable gr_en_dict is the dictionary noted right above.
for key in gr_en_dict:
    user_reply = raw_input('%s: ' % (gr_en_dict[key])).decode(sys.stdout.encoding)

然后我将用户输入的值与适当的键进行比较在字典中：

# I imported unicodedata as ud.
if ud.normalize('NFC', user_reply) == ud.normalize('NFC', key):
        score += 1

在回答与我类似的问题时，用户 ΤΖΩΤΖlOY 说导入模块 unicodedata 并调用标准化方法（我在上面的代码中做了），但我怀疑这可能没有必要。不幸的是，程序的这一步尚不重要，因为我在解码用户输入时遇到问题。为了演示，当我打印 user_reply 的规范字符串表示形式以及字典中相应键的规范字符串表示形式时 [使用内置 repr()]，我得到以下结果：

user's input (user_reply):

u'? \u03b1?\u03b5??\u03b4\u03b1'

如果我打印用户的输入而不使用repr() 函数，它看起来像这样：

? α?ε??δα

在我的字典中键入：

u'\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1'

如果我在没有 repr() 的情况下打印它，我会得到一个错误：

UnicodeEncodeError: 'charmap' codec can't encode character u'\u03b7' in position 0: character maps to <undefined>

注意用户输入中的问号以及当我尝试打印希腊语时出现的错误词正确。这似乎是我的问题的症结所在。

那么，为了解码用户的输入并正确显示所有希腊字符，我到底需要做什么？

当使用我的本机代码页时：

C:\>chcp
Active code page: 437

C:\>\python25\python
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print '? α?ε??δα'
? α?ε??δα
>>>

当使用希腊代码页时：（奇怪的是，它出现只有当我先将其复制到剪贴板，然后将其粘贴到单词类型应用程序中时，我才会正确地发布它在默认控制台中实际打印的图像，但我缺乏这样做的声誉。）

C:\>chcp 869
Active code page: 869

C:\>\python25\python
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp869'
>>> print ' η αγελάδα'
 η αγελάδα
>>> print 'η αγελάδα'
η αγελάδα
>>>

UP：< /strong> 我必须将默认控制台的字体更改为 Lucida Console。这解决了我的矛盾。

原文

I am creating a simple game designed to prompt the user for the Greek translation of an English word. For example:

cow: # here, the gamer would answer with *η αγελάδα* in order to score one point.

I use a helper function to read and decode from a txt file. I do so using the following code in said function:

# The variable filename refers to my helper function's sole parameter, it takes the 
# above mentioned txt file as an argument.
words_text = codecs.open(filename, 'r', 'utf-8')

This helper function then reads each line. The lines resemble something like this:

# In stack data, when I debug, it reads as u"\η αγελάδα - cow\r\n".
u"\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1 - cow\r\n"

The first line of the file when read, however, has an unwanted prefix, ueff-:

# u"\ufeffη αγελάδα - cow\r\n"
u"\ufeff\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1 - cow\r\n"

Note: After reviewing Mark's answer, I found out that the prepended oject (ueff) was a BOM signature (it is used to distinguish UTF-8 from other encodings).

It's a minor issue and I am not sure how to remove it in the tidiest of manners. Anyways, my helper function then creates and returns a new dictionary which looks something like this:

{u'\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1': 'cow'}

Then, in my main function, I use the following in order to store the user's input:

# This is the code for the prompt I noted at the beginning.
# The variable gr_en_dict is the dictionary noted right above.
for key in gr_en_dict:
    user_reply = raw_input('%s: ' % (gr_en_dict[key])).decode(sys.stdout.encoding)

I then compare the value of the user's input with the appropriate key in the dictionary:

# I imported unicodedata as ud.
if ud.normalize('NFC', user_reply) == ud.normalize('NFC', key):
        score += 1

In a response to a question similar to mine, the user ΤΖΩΤΖΙΟΥ said to import the module unicodedata and to call the normalize method (which I did in the code above), but I suspect that might not be necessary. Unfortunately, this step of the program is of no concern just yet because I have a problem decoding the user's input. To demonstrate, when I print the canonical string representation of user_reply and that of the corresponding key in my dictionary [using the built-in repr()] I get the following result:

user's input (user_reply):

u'? \u03b1?\u03b5??\u03b4\u03b1'

If I print the user's input without the repr() function, it looks like this:

? α?ε??δα

key in my dictionary:

u'\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1'

If I print it without repr(), I get an error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\u03b7' in position 0: character maps to <undefined>

Notice the question marks in the user's input and the error I get when I try to print the Greek word proper. This seems to be the crux of my problem.

So, what exactly do I need to do in order to decode the user's input and to display all Greek characters properly?

When using my native code page:

C:\>chcp
Active code page: 437

C:\>\python25\python
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print '? α?ε??δα'
? α?ε??δα
>>>

When using the Greek code page: (strangely, it appears correctly only when I copy it to clipboard first and then paste it into a word type application. I would post an image of the what it actually prints in default console, but I lack the reputation to do so.)

C:\>chcp 869
Active code page: 869

C:\>\python25\python
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp869'
>>> print ' η αγελάδα'
 η αγελάδα
>>> print 'η αγελάδα'
η αγελάδα
>>>

UP: I had to change default console's font to Lucida Console. That solved my discrepancy.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

仅一夜美梦 2024-11-16 17:01:10

对于您的问题的一部分，请使用：

words_text = codecs.open(filename, 'r', 'utf-8-sig')

它将处理 \ufeff 的字节顺序标记。

从技术上讲，这：

user_reply = raw_input('%s: ' % (gr_en_dict[key])).decode(sys.stdout.encoding)

应该是：

user_reply = raw_input('%s: ' % (gr_en_dict[key])).decode(sys.stdin.encoding)

但实际上它们应该是相同的编码。

我相信问题是您的默认控制台中的编码不支持所有希腊字符。当我更改为希腊代码页时，事情开始变得更好。请注意，我可以将正确的字符粘贴到下面的 print 语句中，但 cp437 实际上并不支持所有字符，因此在打印时，不支持的字符会替换为问号：

C:\>chcp
Active code page: 437

C:\>python
Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print 'η αγελάδα - cow'
? α?ε??δα - cow

如果我切换到希腊代码页（869 或 1253），它的工作原理：

C:\>chcp 869
Active code page: 869

C:\>python
Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp869'
>>> print 'η αγελάδα - cow'
η αγελάδα - cow
>>>

For part of your question, use:

words_text = codecs.open(filename, 'r', 'utf-8-sig')

and it will handle processing the byte-order-mark of \ufeff.

Technically, this:

user_reply = raw_input('%s: ' % (gr_en_dict[key])).decode(sys.stdout.encoding)

should be:

user_reply = raw_input('%s: ' % (gr_en_dict[key])).decode(sys.stdin.encoding)

but in practice they should be the same encoding.

I believe the problem is the encoding in your default console does not support all Greek characters. When I change to a Greek code page, things begin to work better. Note that I can paste the correct characters into the print statement below, but cp437 doesn't actually support all the characters, so when printed the unsupported characters are replaced with a question mark:

C:\>chcp
Active code page: 437

C:\>python
Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print 'η αγελάδα - cow'
? α?ε??δα - cow

If I switch to a Greek code page (869 or 1253), it works:

C:\>chcp 869
Active code page: 869

C:\>python
Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp869'
>>> print 'η αγελάδα - cow'
η αγελάδα - cow
>>>

回复收藏 0 原文

茶底世界 2024-11-16 17:01:10

标准 Windows shell 存在扩展字符问题。我建议使用 Windows PowerShell 之类的东西。

对于“\ufeff”字符（字节顺序标记），您可以在读入文件后执行以下检查：

words_text = codecs.open(filename, 'r', 'utf-8')
words_text_lines = words_text.readlines()

if words_text_lines and words_text_lines[0][0]==unicode(codecs.BOM_UTF8, 'utf8'):
    words_text_lines[0] = words_text_lines[0][1:]

这样，如果它存在，您就将其丢弃。

The standard windows shell has issues with extended characters. I would suggest using something like Windows PowerShell.

For the '\ufeff' character, which is the byte order mark, you could perform the following check after reading in the file:

words_text = codecs.open(filename, 'r', 'utf-8')
words_text_lines = words_text.readlines()

if words_text_lines and words_text_lines[0][0]==unicode(codecs.BOM_UTF8, 'utf8'):
    words_text_lines[0] = words_text_lines[0][1:]

That way you're discarding it if it's there.

回复收藏 0 原文

~没有更多了~