ANSI、ASCII、Unicode 以及与 Python 的编码混淆

发布于 2024-09-11 02:05:52 字数 640 浏览 3 评论 0原文

我很高兴使用 BeautifulSoup，并且还使用文本文件作为 Python 脚本的输入参数。

然后我遇到了著名的“UnicodeEncodeError”错误。

我一直在读这里的问题，但我仍然很困惑。

ASCII 与所有这些有什么关系？我在文本编辑器 (Notepad++) 上使用什么编码？美国标准协会？ UTF-8？将字符串解码为 ASCII 似乎并不总是有效（我猜测该字符串采用来自 BeautifulSoup 的不同编码）。我该如何解决这个问题？

无论如何，任何帮助和澄清将不胜感激。

谢谢！

编辑：阅读 BeautifulSoup 的文档，它说它只使用 unicode 但我仍然收到 Unicode 错误:(

  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u300d' in position
 3: character maps to <undefined>

原文

I was happily using BeautifulSoup and I'm also using a text file as input parameters of my Python script.

I then came across the famous "UnicodeEncodeError" error.

I've been reading questions here at SO but I'm still confused.

What does ASCII got to do with all of these?
What encoding do I use on my text editor (Notepad++)? ANSI? UTF-8?
Decoding a string to ASCII doesn't seem to always work (I'm guessing the string is in a different encoding coming from BeautifulSoup). How do I fix this?

Anyway any help and clarifications will be greatly appreciated.

Thanks!

edit:
reading BeautifulSoup's docs, it says that it only uses unicode but I'm still getting Unicode errors :(

  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u300d' in position
 3: character maps to <undefined>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

难如初 2024-09-18 02:05:52

ANSI 不是一种字符编码（通常来说，它指的是某些转义序列，尽管它当然是美国国家标准协会的缩写）。您可以在 Notepad++ 中设置编码（并检查您正在使用的编码）——最好是 utf-8，因为这是一种通用编码（允许您表示任何 Unicode 点）。您可以使用显式的 decode 方法调用从 utf-8 编码文本构建 unicode，或者使用 codecs.open 将文件读取为 unicode（两者都要求您指定您的编码名称——再次希望是“utf8”）。

回复收藏 0 原文

为你拒绝所有暧昧 2024-09-18 02:05:52

ASCII 与所有这些有什么关系？

Python 无法确定使用什么编码来存储文本，因此它默认采用 ascii。然而，ASCII 只定义了前 128 个字符，因此任何超出的字符都会导致解码错误（这实际上是一件好事，因为它不允许您使用错误解码的字符串）。

大多数情况下，您的字符串采用 utf-8 格式，因为它是编码 Unicode 的最常见方式，因此执行 s.decode('utf-8')通常是安全的code> on str 类型字符串（或者使用 unicode(s, 'utf-8') 调用）

如果你事先不知道文本有什么样的编码，并且它不提供编码元数据，您可以尝试使用 chardet 模块。

BeautifulSoup 可以以不同的编码和方式输出结果，因此您只需指定您想要的 unicode 即可。

回复收藏 0 原文

冷心人i 2024-09-18 02:05:52

截至目前（2014 年 1 月 23 日），对于 Notepad++ (NPP)，似乎仍然有很多关于使用 ANSI 作为 Notepad++ 编码术语的最新/未解决的 Bug 报告/讨论。

问题

Google：notepad++ ansi 编码

结果：

#4095 "ANSI as UTF-8" 误导

#124 ansi 编码和德文字母

Notepad++ 的编码方式称为“ANSI”，有谁知道 Ruby 中如何称呼它吗？

Notepad++ 论坛 - 搜索讨论：ANSI 编码

解决方案

以下 NPP 论坛讨论似乎为我指出了最好的解决方案。

请参阅编码检测，ANSI (Windows 1252) 与 UTF-8 (不含 BOM）

首选项->新文档>编码>无 BOM 的 UTF8”称为
适用于打开的 ANSI 文件

我检查了上述内容，与未检查它的作者相反。

然后我开始我的 Python 脚本，如下所示。

#!/usr/bin/python
# -*- coding: utf-8 -*-

As of now (2014, 1, 23), for Notepad++ (NPP) there still seems to be a lot of recent/Unresolved BugReports/Discussions regarding the use of ANSI as a Notepad++ encoding term.

PROBLEM

Google: notepad++ ansi encoding

Results:

#4095 "ANSI as UTF-8" Misleading

#124 ansi encoding and german letters

The encoding that Notepad++ just calls “ANSI”, does anyone know what to call it for Ruby?

Notepad++ Forum - Search discussion: ANSI encoding

SOLUTION

The following NPP Forum Discussion seems to point to the best SOLUTION for me.

See Encoding detection, ANSI (Windows 1252) vs. UTF-8 (w/o BOM)

Preferences -> New Document > Encoding > UTF8 without BOM" called
Apply to opened ANSI files

I CHECKED the above as OPPOSED to the author who UNchecked it.

Then i begin my Python script as follows.

#!/usr/bin/python
# -*- coding: utf-8 -*-

回复收藏 0 原文

~没有更多了~

关于作者

梦幻的心爱

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

ANSI、ASCII、Unicode 以及与 Python 的编码混淆

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

ANSI、ASCII、Unicode 以及与 Python 的编码混淆

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。