Python csv:UnicodeDecodeError
我正在使用 Python 的 csv
模块读取一个文件,并且有另一个编码问题(抱歉,这里有很多问题)。
在 CSV 文件中,有 £ 符号。读入该行并打印后,它们已变为 \xa3。
尝试将它们编码为 Unicode 会产生 UnicodeDecodeError
:
row = [unicode(x.strip()) for x in row]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)
我一直在阅读 csv 文档 以及 StackOverflow 上与此相关的许多其他问题。我认为 £ 在 ASCII 中变成 \xa3 意味着原始 CSV 文件是 UTF-8 格式。
(顺便问一下,有没有一种快速的方法来检查 CSV 文件的编码?)
如果它是 UTF-8,那么 csv 模块不应该能够处理它吗?它似乎将所有符号转换为 ASCII,尽管文档声称它接受 UTF-8。
我尝试添加 unicode_csv_reader
函数,如 csv 示例,但这没有帮助。
---- 编辑 -----
我应该澄清一件事。我见过这个问题,看起来非常相似。但是添加在那里定义的 unicode_csv_reader
函数会产生不同的错误:
yield [unicode(cell, 'utf-8') for cell in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected code byte
所以也许我的文件毕竟不是 UTF8?我怎么知道?
I'm reading in a file with Python's csv
module, and have Yet Another Encoding Question (sorry, there are so many on here).
In the CSV file, there are £ signs. After reading the row in and printing it, they have become \xa3.
Trying to encode them as Unicode produces a UnicodeDecodeError
:
row = [unicode(x.strip()) for x in row]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)
I have been reading the csv documentation and the numerous other questions about this on StackOverflow. I think that £ becoming \xa3 in ASCII means that the original CSV file is in UTF-8.
(Incidentally, is there a quick way to check the encoding of a CSV file?)
If it's in UTF-8, then shouldn't the csv module be able to cope with it? It seems to be transforming all the symbols into ASCII, even though the documentation claims it accepts UTF-8.
I've tried adding a unicode_csv_reader
function as described in the csv examples, but it doesn't help.
---- EDIT -----
I should clarify one thing. I have seen this question, which looks very similar. But adding the unicode_csv_reader
function defined there produces a different error instead:
yield [unicode(cell, 'utf-8') for cell in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected code byte
So maybe my file isn't UTF8 after all? How can I tell?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试使用“ISO-8859-1”进行编码。看起来你正在处理扩展的 ASCII,而不是 Unicode。
编辑:
这里有一些处理扩展 ASCII 的简单代码:
更好的是,处理给你带来问题的确切字符:
Try using the "ISO-8859-1" for your encoding. It seems like you are dealing with extended ASCII, not Unicode.
Edit:
Here's some simple code that deals with extended ASCII:
Even better, dealing with the exact character that is giving you problems:
如果您使用的是 Windows,则您应该使用的编码很可能是 cp125X 系列之一...例如,如果您在西欧或美洲,则它将是 cp1252。 Windows 软件经常使用
\x80
到\x9F
范围内的字节来编码花哨的标点字符,而该范围在 ISO-8859-X 中为很少使用的“C1”保留控制字符”。您可以通过在命令行中运行以下命令来找出您的语言环境中的常用编码:
If you are on Windows, it is highly likely that the encoding that you should use is one of the cp125X family ... e.g. if you are in Western Europe or the Americas, it will be
cp1252
. Windows software often uses bytes in the range\x80
to\x9F
inclusive to encode fancy punctuation characters whereas that range is reserved in ISO-8859-X for the rarely used "C1 Control Characters".You can find out the usual encoding in your locale by running this at the command line: