Python csv:UnicodeDecodeError

发布于 2024-09-14 10:16:22 字数 1206 浏览 3 评论 0原文

我正在使用 Python 的 csv 模块读取一个文件,并且有另一个编码问题(抱歉,这里有很多问题)。

在 CSV 文件中,有 £ 符号。读入该行并打印后,它们已变为 \xa3。

尝试将它们编码为 Unicode 会产生 UnicodeDecodeError

row = [unicode(x.strip()) for x in row]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)

我一直在阅读 csv 文档 以及 StackOverflow 上与此相关的许多其他问题。我认为 £ 在 ASCII 中变成 \xa3 意味着原始 CSV 文件是 UTF-8 格式。

(顺便问一下,有没有一种快速的方法来检查 CSV 文件的编码?)

如果它是 UTF-8,那么 csv 模块不应该能够处理它吗?它似乎将所有符号转换为 ASCII,尽管文档声称它接受 UTF-8。

我尝试添加 unicode_csv_reader 函数,如 csv 示例,但这没有帮助。

---- 编辑 -----

我应该澄清一件事。我见过这个问题,看起来非常相似。但是添加在那里定义的 unicode_csv_reader 函数会产生不同的错误:

yield [unicode(cell, 'utf-8') for cell in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected code byte

所以也许我的文件毕竟不是 UTF8?我怎么知道?

I'm reading in a file with Python's csv module, and have Yet Another Encoding Question (sorry, there are so many on here).

In the CSV file, there are £ signs. After reading the row in and printing it, they have become \xa3.

Trying to encode them as Unicode produces a UnicodeDecodeError:

row = [unicode(x.strip()) for x in row]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)

I have been reading the csv documentation and the numerous other questions about this on StackOverflow. I think that £ becoming \xa3 in ASCII means that the original CSV file is in UTF-8.

(Incidentally, is there a quick way to check the encoding of a CSV file?)

If it's in UTF-8, then shouldn't the csv module be able to cope with it? It seems to be transforming all the symbols into ASCII, even though the documentation claims it accepts UTF-8.

I've tried adding a unicode_csv_reader function as described in the csv examples, but it doesn't help.

---- EDIT -----

I should clarify one thing. I have seen this question, which looks very similar. But adding the unicode_csv_reader function defined there produces a different error instead:

yield [unicode(cell, 'utf-8') for cell in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected code byte

So maybe my file isn't UTF8 after all? How can I tell?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

蝶…霜飞 2024-09-21 10:16:22

尝试使用“ISO-8859-1”进行编码。看起来你正在处理扩展的 ASCII,而不是 Unicode。

编辑:

这里有一些处理扩展 ASCII 的简单代码:

>>> s = "La Pe\xf1a"
>>> print s
La Pe±a
>>> print s.decode("latin-1")
La Peña
>>>

更好的是,处理给你带来问题的确切字符:

>>> s = "12\xa3"
>>> print s.decode("latin-1")
12£
>>>

Try using the "ISO-8859-1" for your encoding. It seems like you are dealing with extended ASCII, not Unicode.

Edit:

Here's some simple code that deals with extended ASCII:

>>> s = "La Pe\xf1a"
>>> print s
La Pe±a
>>> print s.decode("latin-1")
La Peña
>>>

Even better, dealing with the exact character that is giving you problems:

>>> s = "12\xa3"
>>> print s.decode("latin-1")
12£
>>>
乄_柒ぐ汐 2024-09-21 10:16:22

如果您使用的是 Windows,则您应该使用的编码很可能是 cp125X 系列之一...例如,如果您在西欧或美洲,则它将是 cp1252。 Windows 软件经常使用 \x80\x9F 范围内的字节来编码花哨的标点字符,而该范围在 ISO-8859-X 中为很少使用的“C1”保留控制字符”。

您可以通过在命令行中运行以下命令来找出您的语言环境中的常用编码:

python -c "import locale; print locale.getpreferredencoding()"

If you are on Windows, it is highly likely that the encoding that you should use is one of the cp125X family ... e.g. if you are in Western Europe or the Americas, it will be cp1252. Windows software often uses bytes in the range \x80 to \x9F inclusive to encode fancy punctuation characters whereas that range is reserved in ISO-8859-X for the rarely used "C1 Control Characters".

You can find out the usual encoding in your locale by running this at the command line:

python -c "import locale; print locale.getpreferredencoding()"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文