ANSI、ASCII、Unicode 以及与 Python 的编码混淆
我很高兴使用 BeautifulSoup,并且还使用文本文件作为 Python 脚本的输入参数。
然后我遇到了著名的“UnicodeEncodeError”错误。
我一直在读这里的问题,但我仍然很困惑。
ASCII 与所有这些有什么关系? 我在文本编辑器 (Notepad++) 上使用什么编码?美国标准协会? UTF-8? 将字符串解码为 ASCII 似乎并不总是有效(我猜测该字符串采用来自 BeautifulSoup 的不同编码)。我该如何解决这个问题?
无论如何,任何帮助和澄清将不胜感激。
谢谢!
编辑: 阅读 BeautifulSoup 的文档,它说它只使用 unicode 但我仍然收到 Unicode 错误:(
File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u300d' in position
3: character maps to <undefined>
I was happily using BeautifulSoup and I'm also using a text file as input parameters of my Python script.
I then came across the famous "UnicodeEncodeError" error.
I've been reading questions here at SO but I'm still confused.
What does ASCII got to do with all of these?
What encoding do I use on my text editor (Notepad++)? ANSI? UTF-8?
Decoding a string to ASCII doesn't seem to always work (I'm guessing the string is in a different encoding coming from BeautifulSoup). How do I fix this?
Anyway any help and clarifications will be greatly appreciated.
Thanks!
edit:
reading BeautifulSoup's docs, it says that it only uses unicode but I'm still getting Unicode errors :(
File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u300d' in position
3: character maps to <undefined>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
ANSI 不是一种字符编码(通常来说,它指的是某些转义序列,尽管它当然是美国国家标准协会的缩写)。您可以在 Notepad++ 中设置编码(并检查您正在使用的编码)——最好是 utf-8,因为这是一种通用编码(允许您表示任何 Unicode 点)。您可以使用显式的
decode
方法调用从 utf-8 编码文本构建 unicode,或者使用codecs.open
将文件读取为 unicode(两者都要求您指定您的编码名称——再次希望是“utf8”)。ANSI is not a character encoding (in common parlance it refers to certain escape sequences, though it's of course the acronym for the American National Standard Institute). You can set the encoding in Notepad++ (and check what encoding you're using) -- hopefully utf-8, because that's a universal encoding (lets you represent any Unicode point). You build unicode from your utf-8 encoded text with an explicit
decode
method call, or you read the file as unicode with acodecs.open
(both require you to specify your encoding name -- again, hopefully 'utf8').Python 无法确定使用什么编码来存储文本,因此它默认采用 ascii。然而,ASCII 只定义了前 128 个字符,因此任何超出的字符都会导致解码错误(这实际上是一件好事,因为它不允许您使用错误解码的字符串)。
大多数情况下,您的字符串采用 utf-8 格式,因为它是编码 Unicode 的最常见方式,因此执行
s.decode('utf-8')通常 是安全的code> on
str
类型字符串(或者使用unicode(s, 'utf-8')
调用)如果你事先不知道文本有什么样的编码,并且它不提供编码元数据,您可以尝试使用 chardet 模块。
BeautifulSoup 可以以不同的编码和方式输出结果,因此您只需指定您想要的 unicode 即可。
Python has no way to find out what encoding was used to store text, so it assumes ascii by default. However, ASCII defines only first 128 chars, so anything outside results in decode error (which is actually good thing, since it does not let you use incorrectly decoded strings around).
Most of the time your string would be in utf-8, since its most common way to encode Unicode, so its usually safe to do
s.decode('utf-8')
onstr
type strings (or useunicode(s, 'utf-8')
call)If you dont know in advance what kind of encoding text has, and it provides no encoding metadata, you can try using chardet module.
BeautifulSoup can output result in different encodings and ways, so you just need to specify that you want unicode there.
截至目前(2014 年 1 月 23 日),对于 Notepad++ (NPP),似乎仍然有很多关于使用 ANSI 作为 Notepad++ 编码术语的最新/未解决的 Bug 报告/讨论。
问题
Google:notepad++ ansi 编码
结果:
#4095 "ANSI as UTF-8" 误导
#124 ansi 编码和德文字母
Notepad++ 的编码方式称为“ANSI”,有谁知道 Ruby 中如何称呼它吗?
Notepad++ 论坛 - 搜索讨论:ANSI 编码
解决方案
以下 NPP 论坛讨论似乎为我指出了最好的解决方案。
请参阅编码检测,ANSI (Windows 1252) 与 UTF-8 (不含 BOM)
我检查了上述内容,与未检查它的作者相反。
然后我开始我的 Python 脚本,如下所示。
As of now (2014, 1, 23), for Notepad++ (NPP) there still seems to be a lot of recent/Unresolved BugReports/Discussions regarding the use of ANSI as a Notepad++ encoding term.
PROBLEM
Google: notepad++ ansi encoding
Results:
#4095 "ANSI as UTF-8" Misleading
#124 ansi encoding and german letters
The encoding that Notepad++ just calls “ANSI”, does anyone know what to call it for Ruby?
Notepad++ Forum - Search discussion: ANSI encoding
SOLUTION
The following NPP Forum Discussion seems to point to the best SOLUTION for me.
See Encoding detection, ANSI (Windows 1252) vs. UTF-8 (w/o BOM)
I CHECKED the above as OPPOSED to the author who UNchecked it.
Then i begin my Python script as follows.