在Python 2.5中解码未知unicode编码的最佳方法

发布于 2024-08-10 20:04:28 字数 1813 浏览 5 评论 0原文

我完全明白了吗?无论如何,我正在解析大量的 html,但我并不总是知道它的编码是什么(令人惊讶的数字谎言)。下面的代码很容易显示我到目前为止所做的事情,但我确信有更好的方法。我们将非常感谢您的建议。

import logging
import codecs
from utils.error import Error

class UnicodingError(Error):
    pass

# these encodings should be in most likely order to save time
encodings = [ "ascii", "utf_8", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500", "cp737", "cp775", "cp850", "cp852", "cp855", 
    "cp856", "cp857", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949", 
    "cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", 
    "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", 
    "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", 
    "iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14", "iso8859_15", "johab", "koi8_r", "koi8_u", 
    "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", 
    "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8_sig" ]

def unicode(string):
    '''make unicode'''
    for enc in self.encodings:
        try:
            logging.debug("unicoder is trying " + enc + " encoding")
            utf8 = unicode(string, enc)
            logging.info("unicoder is using " + enc + " encoding")
            return utf8
        except UnicodingError:
            if enc == self.encodings[-1]:
                raise UnicodingError("still don't recognise encoding after trying do guess.")

Have I got that all the right way round? Anyway, I am parsing a lot of html, but I don't always know what encoding it's meant to be (a surprising number lie about it). The code below easily shows what I've been doing so far, but I'm sure there's a better way. Your suggestions would be much appreciated.

import logging
import codecs
from utils.error import Error

class UnicodingError(Error):
    pass

# these encodings should be in most likely order to save time
encodings = [ "ascii", "utf_8", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500", "cp737", "cp775", "cp850", "cp852", "cp855", 
    "cp856", "cp857", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949", 
    "cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", 
    "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", 
    "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", 
    "iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14", "iso8859_15", "johab", "koi8_r", "koi8_u", 
    "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", 
    "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8_sig" ]

def unicode(string):
    '''make unicode'''
    for enc in self.encodings:
        try:
            logging.debug("unicoder is trying " + enc + " encoding")
            utf8 = unicode(string, enc)
            logging.info("unicoder is using " + enc + " encoding")
            return utf8
        except UnicodingError:
            if enc == self.encodings[-1]:
                raise UnicodingError("still don't recognise encoding after trying do guess.")

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

盛装女皇 2024-08-17 20:04:28

有两个用于检测未知编码的通用库:

chardet 应该是 firefox 的做法

您可以使用以下正则表达式从字节字符串中检测 utf8:

import re

utf8_detector = re.compile(r"""^(?:
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*$""", re.X)

在实践中,如果您正在处理英语,我发现以下内容在 99.9% 的情况下有效:

  1. 如果它通过了上述正则表达式,则它是 ascii 或 utf8
  2. 如果它包含从 0x80-0x9f 但不包含 0xa4 的任何字节,
  3. 如果它包含 0xa4,则它是 Windows-1252,假设它是 latin-15
  4. 否则假设它是拉丁语1

There are two general purpose libraries for detecting unknown encodings:

chardet is supposed to be a port of the way that firefox does it

You can use the following regex to detect utf8 from byte strings:

import re

utf8_detector = re.compile(r"""^(?:
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*$""", re.X)

In practice if you're dealing with English I've found the following works 99.9% of the time:

  1. if it passes the above regex, it's ascii or utf8
  2. if it contains any bytes from 0x80-0x9f but not 0xa4, it's Windows-1252
  3. if it contains 0xa4, assume it's latin-15
  4. otherwise assume it's latin-1
找个人就嫁了吧 2024-08-17 20:04:28

我解决了同样的问题,发现如果没有有关内容的元数据,就无法确定内容的编码类型。这就是为什么我最终采用了您在这里尝试的相同方法。

我对您所做的唯一额外建议是,您应该按特殊性对其进行排序,而不是按最可能的顺序对可能的编码列表进行排序。我发现某些字符集是其他字符集的子集,因此如果您选择 utf_8 作为您的第二选择,您将错过找到 utf_8 的子集(我认为韩语字符集之一使用与 utf 相同的数字空间)。

I've tackled the same problem and found that there's no way to determine a content's encoding type without metadata about the content. That's why I ended up with the same approach you're trying here.

My only additional advice to what you've done is, rather than ordering the list of possible encoding in most-likely order, you should order it by specificity. I've found that certain character sets are subsets of others, and so if you check utf_8 as your second choice, you'll miss ever finding the subsets of utf_8 (I think one of the Korean character sets uses the same number space as utf).

So尛奶瓶 2024-08-17 20:04:28

由于您使用的是Python,您可以尝试 UnicodeDammit。它是 Beautiful Soup 的一部分,您可能也会发现它很有用。

顾名思义,UnicodeDammit 将尽一切努力从世界上可能发现的垃圾中获取正确的 unicode。

Since you are using Python, you might try UnicodeDammit. It is part of Beautiful Soup that you also may find useful.

Like the name suggests, UnicodeDammit will try to do whatever it takes to get proper unicode out of the crap you may find in the world.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文