Python 文本编码

发布于 2024-10-07 12:53:42 字数 199 浏览 11 评论 0原文

我的文件中有此文本 - Recuérdame（注意它是一个法语单词）。当我使用 python 脚本读取此文件时，我得到的文本为 Recuérdame。

我将它作为 unicode 字符串读取。我是否需要找到文本的编码是什么？解码这个？或者我的终端在欺骗我？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蔚蓝源自深海 2024-10-14 12:53:42

是的，您需要知道文本文件的编码才能将其转换为 unicode 字符串（来自组成文件的字节）。

例如，如果您知道编码是 UTF-8：

with open('foo.txt', 'rb') as f:
    contents = f.read().decode('utf-8-sig')   # -sig takes care of BOM if present

但是，文件中的文本似乎不是采用 Unicode 编码的；重音字符显然存储为 XML 实体，该实体必须是手动转换（向 jleedev 致敬以获取链接）。

Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).

For example, if you know the encoding is UTF-8:

with open('foo.txt', 'rb') as f:
    contents = f.read().decode('utf-8-sig')   # -sig takes care of BOM if present

The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).

回复收藏 0 原文

独闯女儿国 2024-10-14 12:53:42

它不是一个 Unicode 字符串。无论它采用何种编码方式，它都是一个字符串。因此它是一个 UTF-8 或 Latin-1 或其他字符串。在本例中，é 是具体表示 é 的 HTML/XML 实体。它是 HTML 和 XML 中使用的一种编码方式，用于对非 ASCII 数据进行编码。

要将其解码为 Unicode，请查看 Fredrik Lundhs 方法：http://effbot.org /zone/re-sub.htm#unescape-html

回复收藏 0 原文

情徒 2024-10-14 12:53:42

它是 HTML，这个结构被称为“实体”。您可以使用它

def entity_decode(match):
    _, is_hex, entity = match.groups()
    base = 16 if is_hex else 10
    return unichr(int(entity, base))

print re.sub("(?i)(&#(x?)([^;]+);)", 
       entity_decode,
       "Recurdame")

来解码所有实体。

编辑：是的，它们当然不是 latin1，现在它应该适用于所有实体

It is HTML an this construct is called „entity“. You can use

def entity_decode(match):
    _, is_hex, entity = match.groups()
    base = 16 if is_hex else 10
    return unichr(int(entity, base))

print re.sub("(?i)(&#(x?)([^;]+);)", 
       entity_decode,
       "Recurdame")

to decode all etities.

Edit: Yes, they are of course not latin1, now it should work with all entities

回复收藏 0 原文

习惯成性 2024-10-14 12:53:42

与 xlrd 一起工作，我已经排队了
...xl_data.find(str(cell_value))...
这给出了错误：“'ascii'编解码器无法对位置3中的字符u'\xdf'进行编码：序数不在范围(128)中”。
论坛上的所有建议对于我的德语单词来说都是毫无用处的。
但改成：
...xl_data.find(cell.value)...
没有给出错误。
因此，我认为在 xldr 的某些命令中使用字符串作为参数存在特定的编码问题。