是否有Python库函数尝试猜测某些字节的字符编码？

发布于 2024-07-08 07:52:24 字数 373 浏览 14 评论 0原文

我正在用 Python 编写一些邮件处理软件，但在标头字段中遇到了奇怪的字节。我怀疑这只是格式错误的邮件；消息本身声称是 us-ascii，所以我不认为存在真正的编码，但我想得到一个接近原始字符串的 unicode 字符串，而不抛出 UnicodeDecodeError。

因此，我正在寻找一个函数，它接受一个 str 和一些可选的提示，并尽力返回一个 unicode 。我当然可以写一个，但如果存在这样的函数，那么它的作者可能已经更深入地思考了实现此目的的最佳方法。

我还知道 Python 的设计更喜欢显式而不是隐式，并且标准库旨在避免解码文本时的隐式魔法。我只想明确地说“继续猜测”。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

相思故 2024-07-15 07:52:24

+1 chardet 模块。

它不在标准库中，但您可以使用以下命令轻松安装它：

$ pip install chardet

示例：

>>> import urllib.request
>>> rawdata = urllib.request.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}

参见安装 Pip > 如果你没有的话。

+1 for the chardet module.

It is not in the standard library, but you can easily install it with the following command:

$ pip install chardet

Example:

>>> import urllib.request
>>> rawdata = urllib.request.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}

See Installing Pip if you don't have one.

回复收藏 0 原文

烟花易冷人易散 2024-07-15 07:52:24

据我所知，标准库没有函数，尽管按照上面的建议编写一个函数并不太难。我认为我真正寻找的是一种解码字符串并保证它不会抛出异常的方法。 string.decode 的错误参数可以做到这一点。

def decode(s, encodings=('ascii', 'utf8', 'latin1')):
    for encoding in encodings:
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            pass
    return s.decode('ascii', 'ignore')

As far as I can tell, the standard library doesn't have a function, though it's not too difficult to write one as suggested above. I think the real thing I was looking for was a way to decode a string and guarantee that it wouldn't throw an exception. The errors parameter to string.decode does that.

def decode(s, encodings=('ascii', 'utf8', 'latin1')):
    for encoding in encodings:
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            pass
    return s.decode('ascii', 'ignore')

回复收藏 0 原文