是否有Python库函数尝试猜测某些字节的字符编码?
我正在用 Python 编写一些邮件处理软件,但在标头字段中遇到了奇怪的字节。 我怀疑这只是格式错误的邮件; 消息本身声称是 us-ascii,所以我不认为存在真正的编码,但我想得到一个接近原始字符串的 unicode 字符串,而不抛出 UnicodeDecodeError
。
因此,我正在寻找一个函数,它接受一个 str
和一些可选的提示,并尽力返回一个 unicode
。 我当然可以写一个,但如果存在这样的函数,那么它的作者可能已经更深入地思考了实现此目的的最佳方法。
我还知道 Python 的设计更喜欢显式而不是隐式,并且标准库旨在避免解码文本时的隐式魔法。 我只想明确地说“继续猜测”。
I'm writing some mail-processing software in Python that is encountering strange bytes in header fields. I suspect this is just malformed mail; the message itself claims to be us-ascii, so I don't think there is a true encoding, but I'd like to get out a unicode string approximating the original one without throwing a UnicodeDecodeError
.
So, I'm looking for a function that takes a str
and optionally some hints and does its darndest to give me back a unicode
. I could write one of course, but if such a function exists its author has probably thought a bit deeper about the best way to go about this.
I also know that Python's design prefers explicit to implicit and that the standard library is designed to avoid implicit magic in decoding text. I just want to explicitly say "go ahead and guess".
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
+1 chardet 模块。
它不在标准库中,但您可以使用以下命令轻松安装它:
示例:
参见安装 Pip > 如果你没有的话。
+1 for the chardet module.
It is not in the standard library, but you can easily install it with the following command:
Example:
See Installing Pip if you don't have one.
据我所知,标准库没有函数,尽管按照上面的建议编写一个函数并不太难。 我认为我真正寻找的是一种解码字符串并保证它不会抛出异常的方法。 string.decode 的错误参数可以做到这一点。
As far as I can tell, the standard library doesn't have a function, though it's not too difficult to write one as suggested above. I think the real thing I was looking for was a way to decode a string and guarantee that it wouldn't throw an exception. The errors parameter to string.decode does that.
我发现执行此操作的最佳方法是迭代尝试使用 try except 块内的每种最常见的编码来解码预期。
The best way to do this that I've found is to iteratively try decoding a prospective with each of the most common encodings inside of a try except block.