当前位置：文江博客话题详情

如何在Python中读取可以保存为ansi或unicode的文件？

发布于 2024-12-21 03:41:17 字数 123 浏览 1 评论 0原文

我必须编写一个脚本来支持读取可以保存为 Unicode 或 Ansi 的文件（使用 MS 的记事本）。

我的文件中没有任何编码格式的指示，如何支持这两种编码格式？（一种在不提前知道格式的情况下读取文件的通用方法）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜未央樱花落 2024-12-28 03:41:17

MS 记事本为用户提供了 4 种编码选择，用笨拙且令人困惑的术语来表达：

“Unicode”是 UTF-16，以小尾数法编写。 “Unicode big endian”是UTF-16，写作big-endian。在这两种 UTF-16 情况下，这意味着将写入适当的 BOM。使用 utf-16 解码此类文件。

“UTF-8”是UTF-8；记事本明确写入“UTF-8 BOM”。使用 utf-8-sig 来解码此类文件。

“ANSI”令人震惊。这是 MS 术语，意思是“此计算机上的默认旧编码是什么”。

以下是我所知道的 Windows 编码及其所使用的语言/脚本的列表：

cp874  Thai
cp932  Japanese 
cp936  Unified Chinese (P.R. China, Singapore)
cp949  Korean 
cp950  Traditional Chinese (Taiwan, Hong Kong, Macao(?))
cp1250 Central and Eastern Europe 
cp1251 Cyrillic ( Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian)
cp1252 Western European languages
cp1253 Greek 
cp1254 Turkish 
cp1255 Hebrew 
cp1256 Arabic script
cp1257 Baltic languages 
cp1258 Vietnamese
cp???? languages/scripts of India

如果文件是在正在读取的计算机上创建的，那么您可以通过 获取“ANSI”编码locale.getpreferredencoding()。否则，如果您知道它来自哪里，则可以指定要使用的编码（如果它不是 UTF-16）。如果失败的话，猜猜。

在 Windows 上使用 codecs.open() 读取文件时要小心。文档说：“”“注意
即使未指定二进制模式，文件也始终以二进制模式打开。这样做是为了避免由于使用 8 位值编码而导致数据丢失。这意味着在读取和写入时不会自动转换 '\n'。""" 这意味着您的行将以 \r\n 结尾，并且您需要/想要将其删除使用

将它们放在一起：

所有 4 个编码选项保存的示例文本文件在记事本中如下所示：

The quick brown fox jumped over the lazy dogs.
àáâãäå

这是一些演示代码：

import locale

def guess_notepad_encoding(filepath, default_ansi_encoding=None):
    with open(filepath, 'rb') as f:
        data = f.read(3)
    if data[:2] in ('\xff\xfe', '\xfe\xff'):
        return 'utf-16'
    if data == u''.encode('utf-8-sig'):
        return 'utf-8-sig'
    # presumably "ANSI"
    return default_ansi_encoding or locale.getpreferredencoding()

if __name__ == "__main__":
    import sys, glob, codecs
    defenc = sys.argv[1]
    for fpath in glob.glob(sys.argv[2]):
        print
        print (fpath, defenc)
        with open(fpath, 'rb') as f:
            print "raw:", repr(f.read())
        enc = guess_notepad_encoding(fpath, defenc)
        print "guessed encoding:", enc
        with codecs.open(fpath, 'r', enc) as f:
            for lino, line in enumerate(f, 1):
                print lino, repr(line)
                print lino, repr(line.rstrip('\r\n'))

这是使用命令 在 Windows“命令提示符”窗口中运行时的输出。 \python27\python read_notepad.py "" t1-*.txt

('t1-ansi.txt', '')
raw: 'The quick brown fox jumped over the lazy dogs.\r\n\xe0\xe1\xe2\xe3\xe4\xe5
\r\n'
guessed encoding: cp1252
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-u8.txt', '')
raw: '\xef\xbb\xbfThe quick brown fox jumped over the lazy dogs.\r\n\xc3\xa0\xc3
\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\r\n'
guessed encoding: utf-8-sig
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-uc.txt', '')
raw: '\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w
\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\x00e
\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\x00.
\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n\x00'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-ucb.txt', '')
raw: '\xfe\xff\x00T\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\
x00w\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\
x00e\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\
x00.\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

需要注意的事项：

(1) “mbcs”是一种文件系统伪编码，与解码内容完全无关 > 文件的默认编码为 cp1252，它类似于 latin1 (aarrgghh!!)；参见下面的

>>> all_bytes = "".join(map(chr, range(256)))
>>> u1 = all_bytes.decode('cp1252', 'replace')
>>> u2 = all_bytes.decode('mbcs', 'replace')
>>> u1 == u2
False
>>> [(i, u1[i], u2[i]) for i in xrange(256) if u1[i] != u2[i]]
[(129, u'\ufffd', u'\x81'), (141, u'\ufffd', u'\x8d'), (143, u'\ufffd', u'\x8f')
, (144, u'\ufffd', u'\x90'), (157, u'\ufffd', u'\x9d')]
>>>

(2) chardet > 非常擅长检测基于非拉丁文字的编码（中文/日文/韩文、西里尔文、希伯来文、希腊文），但不太擅长基于拉丁文的编码（西欧/中欧/东欧、土耳其语、越南语），并且不太擅长阿拉伯语根本不。

MS Notepad gives the user a choice of 4 encodings, expressed in clumsy confusing terminology:

"Unicode" is UTF-16, written little-endian. "Unicode big endian" is UTF-16, written big-endian. In both UTF-16 cases, this means that the appropriate BOM will be written. Use utf-16 to decode such a file.

"UTF-8" is UTF-8; Notepad explicitly writes a "UTF-8 BOM". Use utf-8-sig to decode such a file.

"ANSI" is a shocker. This is MS terminology for "whatever the default legacy encoding is on this computer".

Here is a list of Windows encodings that I know of and the languages/scripts that they are used for:

cp874  Thai
cp932  Japanese 
cp936  Unified Chinese (P.R. China, Singapore)
cp949  Korean 
cp950  Traditional Chinese (Taiwan, Hong Kong, Macao(?))
cp1250 Central and Eastern Europe 
cp1251 Cyrillic ( Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian)
cp1252 Western European languages
cp1253 Greek 
cp1254 Turkish 
cp1255 Hebrew 
cp1256 Arabic script
cp1257 Baltic languages 
cp1258 Vietnamese
cp???? languages/scripts of India

If the file has been created on the computer where it is being read, then you can obtain the "ANSI" encoding by locale.getpreferredencoding(). Otherwise if you know where it came from, you can specify what encoding to use if it's not UTF-16. Failing that, guess.

Be careful using codecs.open() to read files on Windows. The docs say: """Note
Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.""" This means that your lines will end in \r\n and you will need/want to strip those off.

Putting it all together:

Sample text file, saved with all 4 encoding choices, looks like this in Notepad:

The quick brown fox jumped over the lazy dogs.
àáâãäå

Here is some demo code:

import locale

def guess_notepad_encoding(filepath, default_ansi_encoding=None):
    with open(filepath, 'rb') as f:
        data = f.read(3)
    if data[:2] in ('\xff\xfe', '\xfe\xff'):
        return 'utf-16'
    if data == u''.encode('utf-8-sig'):
        return 'utf-8-sig'
    # presumably "ANSI"
    return default_ansi_encoding or locale.getpreferredencoding()

if __name__ == "__main__":
    import sys, glob, codecs
    defenc = sys.argv[1]
    for fpath in glob.glob(sys.argv[2]):
        print
        print (fpath, defenc)
        with open(fpath, 'rb') as f:
            print "raw:", repr(f.read())
        enc = guess_notepad_encoding(fpath, defenc)
        print "guessed encoding:", enc
        with codecs.open(fpath, 'r', enc) as f:
            for lino, line in enumerate(f, 1):
                print lino, repr(line)
                print lino, repr(line.rstrip('\r\n'))

and here is the output when run in a Windows "Command Prompt" window using the command \python27\python read_notepad.py "" t1-*.txt

('t1-ansi.txt', '')
raw: 'The quick brown fox jumped over the lazy dogs.\r\n\xe0\xe1\xe2\xe3\xe4\xe5
\r\n'
guessed encoding: cp1252
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-u8.txt', '')
raw: '\xef\xbb\xbfThe quick brown fox jumped over the lazy dogs.\r\n\xc3\xa0\xc3
\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\r\n'
guessed encoding: utf-8-sig
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-uc.txt', '')
raw: '\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w
\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\x00e
\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\x00.
\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n\x00'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-ucb.txt', '')
raw: '\xfe\xff\x00T\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\
x00w\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\
x00e\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\
x00.\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

Things to be aware of:

(1) "mbcs" is a file-system pseudo-encoding which has no relevance at all to decoding the contents of files. On a system where the default encoding is cp1252, it makes like latin1 (aarrgghh!!); see below

>>> all_bytes = "".join(map(chr, range(256)))
>>> u1 = all_bytes.decode('cp1252', 'replace')
>>> u2 = all_bytes.decode('mbcs', 'replace')
>>> u1 == u2
False
>>> [(i, u1[i], u2[i]) for i in xrange(256) if u1[i] != u2[i]]
[(129, u'\ufffd', u'\x81'), (141, u'\ufffd', u'\x8d'), (143, u'\ufffd', u'\x8f')
, (144, u'\ufffd', u'\x90'), (157, u'\ufffd', u'\x9d')]
>>>

(2) chardet is very good at detecting encodings based on non-Latin scripts (Chinese/Japanese/Korean, Cyrillic, Hebrew, Greek) but not much good at Latin-based encodings (Western/Central/Eastern Europe, Turkish, Vietnamese) and doesn't grok Arabic at all.

回复收藏 0 原文