chardet 在 Big5 上显然是错误的
我正在解码一个大型(大约千兆字节)平面文件数据库,它随意混合字符编码。到目前为止,Python 模块 chardet 在识别编码方面做得很好,但如果遇到了绊脚石……
In [428]: badish[-3]
Out[428]: '\t\t\t"Kuzey r\xfczgari" (2007) {(#1.2)} [Kaz\xc4\xb1m]\n'
In [429]: chardet.detect(badish[-3])
Out[429]: {'confidence': 0.98999999999999999, 'encoding': 'Big5'}
In [430]: unicode(badish[-3], 'Big5')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
~/src/imdb/<ipython console> in <module>()
UnicodeDecodeError: 'big5' codec can't decode bytes in position 11-12: illegal multibyte sequence
chardet 报告对其编码选择的信心非常高,但事实并非如此。 t 解码... 还有其他明智的方法吗?
I'm decoding a large (about a gigabyte) flat file database, which mixes character encodings willy nilly. The python module chardet
is doing a good job, so far, of identifying the encodings, but if hit a stumbling block...
In [428]: badish[-3]
Out[428]: '\t\t\t"Kuzey r\xfczgari" (2007) {(#1.2)} [Kaz\xc4\xb1m]\n'
In [429]: chardet.detect(badish[-3])
Out[429]: {'confidence': 0.98999999999999999, 'encoding': 'Big5'}
In [430]: unicode(badish[-3], 'Big5')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
~/src/imdb/<ipython console> in <module>()
UnicodeDecodeError: 'big5' codec can't decode bytes in position 11-12: illegal multibyte sequence
chardet reports a very high confidence in it's choice of encodings, but it doesn't decode... Are there any other sensible approaches?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不能过分强调的一点是:您不应该期望从一段如此短且其中包含如此高比例的普通旧 ASCII 字符的文本中得到任何合理的编码猜测。
big5:chardet 在检查 CJK 编码时撒下了非常广泛的网。 big5中有很多未使用的槽位,chardet并不排除它们。正如您所发现的,该字符串不是有效的 big5。它实际上是有效的(但没有意义)big5_hkscs(它使用了big5中的很多漏洞)。
有大量适合该字符串的单字节编码。
在这个阶段有必要寻求带外帮助。谷歌搜索“Kuzey etc”会找到土耳其电视剧“Kuzey rüzgari”,所以我们现在有了该语言。
这意味着,如果它是由熟悉土耳其语的人输入的,则它可能位于 cp1254、iso_8859_3(或 _9)或 mac_turkish 中。所有这些都会在结尾处产生 [Kaz??m] 单词的乱码。根据 imdb 网站的说法,这是一个角色的名字,与使用 cp1254 和 iso-8859-9 (Kazä±m) 解码得到的乱码是一样的。使用您建议的 iso-8859-2 进行解码会得到 Kaząm,这看起来也不太合理。
你能概括一下吗?我不这么认为:-)
我强烈建议在这种情况下使用 latin1 对其进行解码(这样就不会损坏任何字节)并将记录标记为具有未知编码。您还应该使用最小长度截止值。
更新 就其价值而言,the_two_bytes_in_the_character_name.decode('utf8') 生成 U+0131 拉丁文小写字母 DOTLESS I,用于土耳其语和阿塞拜疆语。进一步谷歌搜索表明卡齐姆是一个很常见的土耳其名字。
A point that can't be stressed too strongly: You should not expect any reasonable encoding guess from a piece of text that is so short and has such a high percentage of plain old ASCII characters in it.
About big5: chardet casts a very wide net when checking CJK encodings. There are lots of unused slots in big5, and chardet doesn't exclude them. That string is not valid big5, as you have found out. It is in fact valid (but meaningless) big5_hkscs (which used a lot of the holes in big5).
There are an enormous number of single-byte encodings that fit the string.
At this stage it's necessary to seek out-of-band help. Googling "Kuzey etc" drags up a Turkish TV series "Kuzey rüzgari" so we now have the language.
That means that if it was entered by a person familar with Turkish, it could be in cp1254, or iso_8859_3 (or _9), or mac_turkish. All of those produce gibberish for the [Kaz??m] word near the end. According to the imdb website, that's the name of a character, and it's the same gibberish as obtained by decoding with cp1254 and iso-8859-9 (Kazım). Decoding with your suggested iso-8859-2 gives KazĹm which doesn't look very plausible either.
Can you generalise this? I don't think so :-)
I would strongly suggest that in such a case that you decode it using latin1 (so that no bytes are mangled) and flag the record as having unknown encoding. You should use a minimum length cutoff as well.
Update For what it's worth, the_two_bytes_in_the_character_name.decode('utf8') produces U+0131 LATIN SMALL LETTER DOTLESS I which is used in Turkish and Azerbaijani. Further googling indicates that Kazım is a common-enough Turkish given name.