在Python 2.5中解码未知unicode编码的最佳方法

发布于 2024-08-10 20:04:28 字数 1813 浏览 5 评论 0原文

我完全明白了吗？无论如何，我正在解析大量的 html，但我并不总是知道它的编码是什么（令人惊讶的数字谎言）。下面的代码很容易显示我到目前为止所做的事情，但我确信有更好的方法。我们将非常感谢您的建议。

import logging
import codecs
from utils.error import Error

class UnicodingError(Error):
    pass

# these encodings should be in most likely order to save time
encodings = [ "ascii", "utf_8", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500", "cp737", "cp775", "cp850", "cp852", "cp855", 
    "cp856", "cp857", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949", 
    "cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", 
    "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", 
    "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", 
    "iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14", "iso8859_15", "johab", "koi8_r", "koi8_u", 
    "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", 
    "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8_sig" ]

def unicode(string):
    '''make unicode'''
    for enc in self.encodings:
        try:
            logging.debug("unicoder is trying " + enc + " encoding")
            utf8 = unicode(string, enc)
            logging.info("unicoder is using " + enc + " encoding")
            return utf8
        except UnicodingError:
            if enc == self.encodings[-1]:
                raise UnicodingError("still don't recognise encoding after trying do guess.")

原文

Have I got that all the right way round? Anyway, I am parsing a lot of html, but I don't always know what encoding it's meant to be (a surprising number lie about it). The code below easily shows what I've been doing so far, but I'm sure there's a better way. Your suggestions would be much appreciated.

import logging
import codecs
from utils.error import Error

class UnicodingError(Error):
    pass

# these encodings should be in most likely order to save time
encodings = [ "ascii", "utf_8", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500", "cp737", "cp775", "cp850", "cp852", "cp855", 
    "cp856", "cp857", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949", 
    "cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", 
    "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", 
    "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", 
    "iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14", "iso8859_15", "johab", "koi8_r", "koi8_u", 
    "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", 
    "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8_sig" ]

def unicode(string):
    '''make unicode'''
    for enc in self.encodings:
        try:
            logging.debug("unicoder is trying " + enc + " encoding")
            utf8 = unicode(string, enc)
            logging.info("unicoder is using " + enc + " encoding")
            return utf8
        except UnicodingError:
            if enc == self.encodings[-1]:
                raise UnicodingError("still don't recognise encoding after trying do guess.")

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

盛装女皇 2024-08-17 20:04:28

有两个用于检测未知编码的通用库：

chardet，通用 Feed 解析器的一部分
UnicodeDammit，Beautiful Soup

chardet 应该是 firefox 的做法

您可以使用以下正则表达式从字节字符串中检测 utf8：

import re

utf8_detector = re.compile(r"""^(?:
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*$""", re.X)

在实践中，如果您正在处理英语，我发现以下内容在 99.9% 的情况下有效：

如果它通过了上述正则表达式，则它是 ascii 或 utf8
如果它包含从 0x80-0x9f 但不包含 0xa4 的任何字节，
如果它包含 0xa4，则它是 Windows-1252，假设它是 latin-15
否则假设它是拉丁语1

There are two general purpose libraries for detecting unknown encodings:

chardet, part of Universal Feed Parser
UnicodeDammit, part of Beautiful Soup

chardet is supposed to be a port of the way that firefox does it

You can use the following regex to detect utf8 from byte strings:

import re

utf8_detector = re.compile(r"""^(?:
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*$""", re.X)

In practice if you're dealing with English I've found the following works 99.9% of the time: