Python 文本编码
我的文件中有此文本 - Recuérdame(注意它是一个法语单词)。当我使用 python 脚本读取此文件时,我得到的文本为 Recuérdame
。
我将它作为 unicode 字符串读取。我是否需要找到文本的编码是什么?解码这个?或者我的终端在欺骗我?
I have this text in a file - Recuérdame (notice it's a French word). When I read this file with a python script, I get this text as Recuérdame
.
I read it as a unicode string. Do I need to find what the encoding of the text is & decode this? or is my terminal playing tricks on me?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
是的,您需要知道文本文件的编码才能将其转换为 unicode 字符串(来自组成文件的字节)。
例如,如果您知道编码是 UTF-8:
但是,文件中的文本似乎不是采用 Unicode 编码的;重音字符显然存储为 XML 实体,该实体必须是 手动转换(向 jleedev 致敬以获取链接)。
Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).
For example, if you know the encoding is UTF-8:
The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).
它不是一个 Unicode 字符串。无论它采用何种编码方式,它都是一个字符串。因此它是一个 UTF-8 或 Latin-1 或其他字符串。在本例中,
é
是具体表示 é 的 HTML/XML 实体。它是 HTML 和 XML 中使用的一种编码方式,用于对非 ASCII 数据进行编码。要将其解码为 Unicode,请查看 Fredrik Lundhs 方法:http://effbot.org /zone/re-sub.htm#unescape-html
It is not a Unicode string. It's a string in whatever encoding it is encoded in. Hence it's a UTF-8 or a Latin-1 or something else string. In this case,
é
is a HTML/XML entity representing é, specifically. It's an encoding used in HTML and XML to encode non-ascii data.To decode that into Unicode, look at Fredrik Lundhs method: http://effbot.org/zone/re-sub.htm#unescape-html
它是 HTML,这个结构被称为“实体”。您可以使用它
来解码所有实体。
编辑:是的,它们当然不是 latin1,现在它应该适用于所有实体
It is HTML an this construct is called „entity“. You can use
to decode all etities.
Edit: Yes, they are of course not latin1, now it should work with all entities
与 xlrd 一起工作,我已经排队了
...xl_data.find(str(cell_value))...
这给出了错误:“'ascii'编解码器无法对位置3中的字符u'\xdf'进行编码:序数不在范围(128)中”。
论坛上的所有建议对于我的德语单词来说都是毫无用处的。
但改成:
...xl_data.find(cell.value)...
没有给出错误。
因此,我认为在 xldr 的某些命令中使用字符串作为参数存在特定的编码问题。
Working with xlrd, I have in a line
...xl_data.find(str(cell_value))...
which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)".
All suggestions in the forums have been useless for my german words.
But changing into:
...xl_data.find(cell.value)...
gives no error.
So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.