Python Unicode CSV 导出(使用 Django)
我正在使用 Django 应用程序将字符串导出到 CSV 文件。该字符串是通过前端表单提交的消息。但是,当输入中提供 unicode 单引号时,我收到此错误。
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
我一直在尝试使用下面的代码将 unicode 转换为 ascii,但仍然遇到类似的错误。
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
我已经浏览了数十个网站并了解了很多有关 unicode 的知识,但是,我仍然无法将这个 unicode 转换为 ascii。我不在乎算法是否删除 unicode 字符。注释行表示我尝试过的一些不同选项,但错误仍然存在。
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
有人对删除 unicode 字符有任何建议吗?我可以将它们导出到 CSV?这个看似简单的问题却让我头晕目眩。非常感谢任何帮助。 谢谢, 乔
I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您无法将 Unicode 字符
u'\u2019'
(U+2019 右单引号)编码为 ASCII,因为 ASCII 中没有该字符。 ASCII只是基本的拉丁字母、数字和标点符号;你不会得到任何像这个字符一样的重音字母或“智能引号”。所以你必须选择另一种编码。现在通常明智的做法是导出为 UTF-8,它可以容纳任何 Unicode 字符。不幸的是,如果您的目标用户正在使用 Office(他们可能正在使用),他们将无法读取 CSV 中的 UTF-8 编码字符。相反,Excel 将使用该机器的系统默认代码页(也被误导性地称为“ANSI”代码页)读取文件,并最终得到 mojibake,如
'
而不是'
。因此,这意味着如果您希望字符正确显示,您必须猜测用户的系统默认代码页。对于西方用户,这将是代码页 1252。安装非西方 Windows 的用户将看到错误的字符,但对此您无能为力(除了组织一次给 Microsoft 写信的活动,以消除愚蠢的废话)已经 ANSI 并像其他人一样使用 UTF-8)。
代码页 1252 可以包含 U+2019 (
'
),但显然还有更多的字符它无法表示。为了避免这些字符出现UnicodeEncodeError
错误,您可以使用ignore
参数(或replace
将它们替换为问号)。或者,放弃并删除所有非 ASCII 字符,这样无论语言环境如何,每个人都会获得同样糟糕的体验:
You can't encode the Unicode character
u'\u2019'
(U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like
’
instead of’
.So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (
’
), but obviously there are many more characters it can't represent. To avoid gettingUnicodeEncodeError
for those characters you can use theignore
argument (orreplace
to replace them with question marks).alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
编码是一件痛苦的事情,但是如果您在 django 中工作,您是否尝试过
django.utils.encoding
中的smart_unicode(str)
?我发现这通常可以解决问题。我发现的唯一其他选择是对字符串使用内置的 python
encode()
和decode()
,但是您必须指定这些字符串的编码老实说,这是一种痛苦。Encoding is a pain, but if you're working in django have you tried
smart_unicode(str)
fromdjango.utils.encoding
? I find that usually does the trick.The only other option I've found is to use the built-in python
encode()
anddecode()
for strings, but you have to specify the encoding for those and honestly, it's a pain.[警告:我不是 djangoist; django 可能有更好的解决方案]。
一般非 django 特定的答案:
如果您有少量已知的非 ASCII 字符,并且有用户可接受的 ASCII 等效项,您可以设置一个转换表并使用
unicode.translate
方法:[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the
unicode.translate
method: