Python：将 CSV 文件的 Unicode 转换为 ASCII，没有错误

发布于 2024-10-11 00:29:25 字数 700 浏览 8 评论 0原文

我一直在 StackOverflow 中阅读有关在 Python 中从 Unicode 转换为 CSV 的所有问题，但我仍然迷失方向。每次我收到“UnicodeEncodeError: 'ascii' codec can't Encode character u'\xd1' inposition 12: ordinal not in range(128)”

buffer=cStringIO.StringIO()
writer=csv.writer(buffer, csv.excel)
cr.execute(query, query_param)
while (1):
    row = cr.fetchone()
    writer.writerow([s.encode('ascii','ignore') for s in row])

时， row 的值就是

(56, u"LIMPIADOR BA\xd1O 1'5 L")

\ 的值数据库中的 xd10 是 ñ，带有西班牙语中使用的变音代字号。起初，我尝试将值转换为 ascii 中有效的值，但在损失了很多时间之后，我只是尝试忽略这些字符（我想我在重音元音方面也会遇到同样的问题）。

我想将该值保存到 CSV 中，最好使用 ñ（“LIMPIADOR BAÑO 1'5 L”），但如果不可能，至少能够保存它（“LIMPIADOR BAO 1'5 L”）。

原文

I've been reading all questions regarding conversion from Unicode to CSV in Python here in StackOverflow and I'm still lost. Everytime I receive a "UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 12: ordinal not in range(128)"

buffer=cStringIO.StringIO()
writer=csv.writer(buffer, csv.excel)
cr.execute(query, query_param)
while (1):
    row = cr.fetchone()
    writer.writerow([s.encode('ascii','ignore') for s in row])

The value of row is

(56, u"LIMPIADOR BA\xd1O 1'5 L")

where the value of \xd10 at the database is ñ, a n with a diacritical tilde used in Spanish. At first I tried to convert the value to something valid in ascii, but after losing so much time I'm trying only to ignore those characters (I suppose I'd have the same problem with accented vowels).

I'd like to save the value to the CSV, preferably with the ñ ("LIMPIADOR BAÑO 1'5 L"), but if not possible, at least be able to save it ("LIMPIADOR BAO 1'5 L").

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

奈何桥上唱咆哮 2024-10-18 00:29:25

正确，ñ 不是有效的 ASCII 字符，因此您无法将其编码为 ASCII。因此，您可以像上面的代码一样忽略它们。另一种方法，即删除重音符号，您可以在这里找到：
最好的方法是什么删除 Python unicode 字符串中的重音符号？

但请注意，这两种技术都会产生不良影响，例如使单词实际上具有不同的含义等。因此最好是保留重音符号。然后你就不能使用ASCII，但你可以使用另一种编码。 UTF-8 是安全的选择。 Latin-1 或 ISO-88591-1 是常见的一种，但它仅包含西欧字符。 CP-1252 在 Windows 等上很常见。

因此，只需将“ascii”切换为您想要的任何编码即可。

根据您的评论，您的实际代码是：

writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])

现在

row = (56, u"LIMPIADOR BA\xd1O 1'5 L")

，我相信这应该有效，但显然它不起作用。我认为 unicode 无论如何都会被错误地传递到 cvs writer 中。将那条长线展开到它的各个部分：

col1, col2 = row # Use the names of what is actually there instead
row = col1, col2.encode('utf8')
writer.writerow(row)

现在，您真正的错误不会因为您将所有内容都放在同一行中而被隐藏。如果您包含适当的回溯，这也可能是可以避免的。

Correct, ñ is not a valid ASCII character, so you can't encode it to ASCII. So you can, as your code does above, ignore them. Another way, namely to remove the accents, you can find here:
What is the best way to remove accents in a Python unicode string?

But note that both techniques can result in bad effects, like making words actually mean something different, etc. So the best is to keep the accents. And then you can't use ASCII, but you can use another encoding. UTF-8 is the safe bet. Latin-1 or ISO-88591-1 is common one, but it includes only Western European characters. CP-1252 is common on Windows, etc, etc.

So just switch "ascii" for whatever encoding you want.

Your actual code, according to your comment is:

writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])

where

row = (56, u"LIMPIADOR BA\xd1O 1'5 L")

Now, I believe that should work, but apparently it doesn't. I think unicode gets passed into the cvs writer by mistake anyway. Unwrap that long line to it's parts:

col1, col2 = row # Use the names of what is actually there instead
row = col1, col2.encode('utf8')
writer.writerow(row)

Now your real error will not be hidden by the fact that you stick everything in the same line. This could also probably have been avoided if you had included a proper traceback.

回复收藏 0 原文

~没有更多了~