蟒蛇 + PostgreSQL +奇怪的ascii = UTF8编码错误

发布于 2024-09-05 04:27:27 字数 802 浏览 2 评论 0原文

我有包含字符 "\x80" 的 ascii 字符串来表示欧元符号：

>>> print "\x80"
€

当将包含该字符的字符串数据插入我的数据库时，我得到：

psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0x80
HINT:  This error can also happen if the byte sequence does not match the encodi
ng expected by the server, which is controlled by "client_encoding".

我是一个 unicode 新手。如何将包含 "\x80" 的字符串转换为包含相同欧元符号的有效 UTF-8？我尝试对各种字符串调用 .encode 和 .decode ，但遇到错误：

>>> "\x80".encode("utf-8")
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    "\x80".encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

原文

I have ascii strings which contain the character "\x80" to represent the euro symbol:

>>> print "\x80"
€

When inserting string data containing this character into my database, I get:

psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0x80
HINT:  This error can also happen if the byte sequence does not match the encodi
ng expected by the server, which is controlled by "client_encoding".

I'm a unicode newbie. How can I convert my strings containing "\x80" to valid UTF-8 containing that same euro symbol? I've tried calling .encode and .decode on various strings, but run into errors:

>>> "\x80".encode("utf-8")
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    "\x80".encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

路弥 2024-09-12 04:27:27

这个问题始于一个错误的前提：

我有 ASCII 字符串，其中包含字符“\x80”来表示欧元符号。

ASCII 字符的范围为“\x00”到“\x7F”（含）。

先前接受的现已删除的答案在两个严重误解下运行（1）区域设置==编码（2）latin1编码将“\x80”映射到欧元字符。

事实上，所有 ISO-8859-x 编码都将“\x80”映射到 U+0080，它是 C1 控制字符之一，而不是欧元字符。这些编码中只有 3 个（(7, 15, 16) 中的 x）提供欧元字符，如“\xA4”。请参阅这篇维基百科文章。

您需要知道您的数据采用什么编码。它是在哪台机器上创建的？如何？它创建的语言环境（不一定是您的语言环境）可能会给您提供线索。

请注意，“我的数据是用 latin1 编码的”与“支票已在邮件中”和“当然，我早上会爱你”。您的数据可能采用 Windows 平台上的 cp125x 编码之一进行编码。请注意，除了 cp1251（Windows 西里尔文）之外，所有这些都将“\x80”映射到欧元字符：

>>> ['\x80'.decode('cp125' + str(x), 'replace') for x in range(9)]
[u'\u20ac', u'\u0402', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac']

更新以响应OP的评论

我正在从文件中读取此数据，例如open(fname).read()。它包含带有 \x80 的字符串，表示欧元字符。它只是一个纯文本文件。它是由另一个程序生成的，但我不知道它是如何生成文本的。什么是一个好的解决方案？我想我可以假设它为欧元字符输出“\x80”，这意味着我可以假设它是用 cp125x 编码的，该 cp125x 将该字符作为欧元。

这有点令人困惑：首先你说

它包含带有 \x80 的字符串，表示欧元字符

但后来你说

我想我可以假设它为欧元字符输出“\x80”

请解释一下。

选择适当的 cp125x 编码：文件在哪里（地理位置）创建？文本是用什么语言写的？除假定的欧元以外的任何字符，其值 > “\x7f”？如果是的话，它们在哪些情况下使用？

更新2 如果您不“知道程序是如何编写的”，您和我们都无法对它是否始终使用“\x80”作为欧元字符形成意见。尽管不这样做将是极其愚蠢的行为，但不能排除这种情况。

如果文本是用英语编写的和/或在美国编写的和/或在 Windows 平台上编写的，那么可以相当肯定 cp1252 是正确的选择...直到你得到相反的证据，在这种情况下，你需要自己猜测编码或回答（什么语言，什么地点）问题。

The question starts with a false premise:

I have ascii strings which contain the character "\x80" to represent the euro symbol.

ASCII characters are in the range "\x00" to "\x7F" inclusive.

The previously-accepted now-deleted answer operated under two gross misapprehensions (1) that locale == encoding (2) that the latin1 encoding maps "\x80" to a Euro character.

In fact, all of the ISO-8859-x encodings map "\x80" to U+0080 which is one of the C1 control characters, not a Euro character. Only 3 of those encodings (x in (7, 15, 16)) provide the Euro character, as "\xA4". See this Wikipedia article.

You need to know what encoding your data is in. What machine was it created on? How? The locale it was created in (not necessarily yours) may give you a clue.

Note that "My data is encoded in latin1" is up there with "The cheque's in the mail" and "Of course I'll love you in the morning". Your data is probably encoded in one of the cp125x encodings found on Windows platforms. Note that all of them except cp1251 (Windows Cyrillic) map "\x80" to the euro character:

>>> ['\x80'.decode('cp125' + str(x), 'replace') for x in range(9)]
[u'\u20ac', u'\u0402', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac']

Update in response to the OP's comment

I'm reading this data from a file, e.g. open(fname).read(). It contains strings with \x80 in them that represents the euro character. it's just a plain text file. it is generated by another program, but I don't know how it goes about generating the text. what would be a good solution? I'm thinking I can assume that it outputs "\x80" for a euro character, meaning I can assume it's encoded with a cp125x that has that char as the euro.

This is a bit confusing: First you say

It contains strings with \x80 in them that represents the euro character

But later you say

I'm thinking I can assume that it outputs "\x80" for a euro character

Please explain.

Selecting an appropriate cp125x encoding: Where (geographical location) was the file created? In what language(s) is the text written? Any characters other than the presumed euro with values > "\x7f"? If so, which ones and what context are they used in?

Update 2 If you don't "know how the program is written", neither you nor we can form an opinion on whether it always uses "\x80" for the euro character. Although doing otherwise would be monumental silliness, it can't be ruled out.

If the text is written in the English language and/or it is written in the USA, and/or it's written on a Windows platform, then it's reasonably certain that cp1252 is the way to go ... until you get evidence to the contrary, in which case you'd need to guess an encoding by yourself or answer the (what language, what locality) questions.

回复收藏 0 原文

~没有更多了~