ANSI、ASCII、Unicode 以及与 Python 的编码混淆

发布于 2024-09-11 02:05:52 字数 640 浏览 3 评论 0原文

我很高兴使用 BeautifulSoup,并且还使用文本文件作为 Python 脚本的输入参数。

然后我遇到了著名的“UnicodeEncodeError”错误。

我一直在读这里的问题,但我仍然很困惑。

ASCII 与所有这些有什么关系? 我在文本编辑器 (Notepad++) 上使用什么编码?美国标准协会? UTF-8? 将字符串解码为 ASCII 似乎并不总是有效(我猜测该字符串采用来自 BeautifulSoup 的不同编码)。我该如何解决这个问题?

无论如何,任何帮助和澄清将不胜感激。

谢谢!

编辑: 阅读 BeautifulSoup 的文档,它说它只使用 unicode 但我仍然收到 Unicode 错误:(

  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u300d' in position
 3: character maps to <undefined>

I was happily using BeautifulSoup and I'm also using a text file as input parameters of my Python script.

I then came across the famous "UnicodeEncodeError" error.

I've been reading questions here at SO but I'm still confused.

What does ASCII got to do with all of these?
What encoding do I use on my text editor (Notepad++)? ANSI? UTF-8?
Decoding a string to ASCII doesn't seem to always work (I'm guessing the string is in a different encoding coming from BeautifulSoup). How do I fix this?

Anyway any help and clarifications will be greatly appreciated.

Thanks!

edit:
reading BeautifulSoup's docs, it says that it only uses unicode but I'm still getting Unicode errors :(

  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u300d' in position
 3: character maps to <undefined>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

难如初 2024-09-18 02:05:52

ANSI 不是一种字符编码(通常来说,它指的是某些转义序列,尽管它当然是美国国家标准协会的缩写)。您可以在 Notepad++ 中设置编码(并检查您正在使用的编码)——最好是 utf-8,因为这是一种通用编码(允许您表示任何 Unicode 点)。您可以使用显式的 decode 方法调用从 utf-8 编码文本构建 unicode,或者使用 codecs.open 将文件读取为 unicode(两者都要求您指定您的编码名称——再次希望是“utf8”)。

ANSI is not a character encoding (in common parlance it refers to certain escape sequences, though it's of course the acronym for the American National Standard Institute). You can set the encoding in Notepad++ (and check what encoding you're using) -- hopefully utf-8, because that's a universal encoding (lets you represent any Unicode point). You build unicode from your utf-8 encoded text with an explicit decode method call, or you read the file as unicode with a codecs.open (both require you to specify your encoding name -- again, hopefully 'utf8').

为你拒绝所有暧昧 2024-09-18 02:05:52

ASCII 与所有这些有什么关系?

Python 无法确定使用什么编码来存储文本,因此它默认采用 ascii。然而,ASCII 只定义了前 128 个字符,因此任何超出的字符都会导致解码错误(这实际上是一件好事,因为它不允许您使用错误解码的字符串)。

大多数情况下,您的字符串采用 utf-8 格式,因为它是编码 Unicode 的最常见方式,因此执行 s.decode('utf-8')通常 是安全的code> on str 类型字符串(或者使用 unicode(s, 'utf-8') 调用)

如果你事先不知道文本有什么样的编码,并且它不提供编码元数据,您可以尝试使用 chardet 模块。

BeautifulSoup 可以以不同的编码和方式输出结果,因此您只需指定您想要的 unicode 即可。

What does ASCII got to do with all of these?

Python has no way to find out what encoding was used to store text, so it assumes ascii by default. However, ASCII defines only first 128 chars, so anything outside results in decode error (which is actually good thing, since it does not let you use incorrectly decoded strings around).

Most of the time your string would be in utf-8, since its most common way to encode Unicode, so its usually safe to do s.decode('utf-8') on str type strings (or use unicode(s, 'utf-8') call)

If you dont know in advance what kind of encoding text has, and it provides no encoding metadata, you can try using chardet module.

BeautifulSoup can output result in different encodings and ways, so you just need to specify that you want unicode there.

冷心人i 2024-09-18 02:05:52

截至目前(2014 年 1 月 23 日),对于 Notepad++ (NPP),似乎仍然有很多关于使用 ANSI 作为 Notepad++ 编码术语的最新/未解决的 Bug 报告/讨论。

问题

Google:notepad++ ansi 编码

结果:

#4095 "ANSI as UTF-8" 误导

#124 ansi 编码和德文字母

Notepad++ 的编码方式称为“ANSI”,有谁知道 Ruby 中如何称呼它吗?

Notepad++ 论坛 - 搜索讨论:ANSI 编码

解决方案

以下 NPP 论坛讨论似乎为我指出了最好的解决方案。

请参阅编码检测,ANSI (Windows 1252) 与 UTF-8 (不含 BOM)

首选项->新文档>编码>无 BOM 的 UTF8”称为
适用于打开的 ANSI 文件

我检查了上述内容,与未检查它的作者相反。

然后我开始我的 Python 脚本,如下所示。

#!/usr/bin/python
# -*- coding: utf-8 -*-

As of now (2014, 1, 23), for Notepad++ (NPP) there still seems to be a lot of recent/Unresolved BugReports/Discussions regarding the use of ANSI as a Notepad++ encoding term.

PROBLEM

Google: notepad++ ansi encoding

Results:

#4095 "ANSI as UTF-8" Misleading

#124 ansi encoding and german letters

The encoding that Notepad++ just calls “ANSI”, does anyone know what to call it for Ruby?

Notepad++ Forum - Search discussion: ANSI encoding

SOLUTION

The following NPP Forum Discussion seems to point to the best SOLUTION for me.

See Encoding detection, ANSI (Windows 1252) vs. UTF-8 (w/o BOM)

Preferences -> New Document > Encoding > UTF8 without BOM" called
Apply to opened ANSI files

I CHECKED the above as OPPOSED to the author who UNchecked it.

Then i begin my Python script as follows.

#!/usr/bin/python
# -*- coding: utf-8 -*-
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文