Python - Python 3.1 似乎无法处理 UTF-16 编码的文件?

发布于 2024-10-31 19:05:54 字数 750 浏览 1 评论 0原文

我正在尝试运行一些代码来简单地浏览一堆文件并将那些恰好是 .txt 文件的文件写入同一个文件中,删除所有空格。这是一些应该可以解决问题的简单代码:

for subdir, dirs, files in os.walk(rootdir):
for file in files:
    if '.txt' in file:
        f = open(subdir+'/'+file, 'r')
        line = f.readline()
        while line:
            line2 = line.split()
            if line2:
                output_file.write(" ".join(line2)+'\n')
            line = f.readline()
        f.close()

但是,我收到以下错误:

文件“/usr/lib/python3.1/codecs.py”,第 300 行,在解码中 (结果,消耗)= self._buffer_decode(数据,self.errors,最终) UnicodeDecodeError: 'utf8' 编解码器无法解码位置 0 中的字节 0xfe:意外的代码字节

事实证明,这些 .txt 文件都是 UTF-16 格式的(无论如何,根据 FireFox 的说法)。我认为 Python 3.x 应该能够处理任何类型的字符编码?

最好的, 乔治娜

I'm trying to run some code to simply go through a bunch of files and write those that happen to be .txt files into the same file, removing all the spaces. Here's some simple code that should do the trick:

for subdir, dirs, files in os.walk(rootdir):
for file in files:
    if '.txt' in file:
        f = open(subdir+'/'+file, 'r')
        line = f.readline()
        while line:
            line2 = line.split()
            if line2:
                output_file.write(" ".join(line2)+'\n')
            line = f.readline()
        f.close()

But instead, I get the following error:

File "/usr/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 0: unexpected code byte

It turns out these .txt files are all in UTF-16 (according to FireFox, at any rate). I thought Python 3.x was supposed to be able to handle any sort of character encoding??

Best,
Georgina

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

只是一片海 2024-11-07 19:05:54

使用open(bla, 'r',encoding="utf-16")

Use open(bla, 'r', encoding="utf-16").

明媚如初 2024-11-07 19:05:54

有多种 utf-16 编码。

  • utf-16-be 大端无 BOM

  • utf-16-le 小端无 BOM

  • utf-16 小端 + BOM

示例:

Python 3.2 (r32:88452, Feb 20 2011, 11:12:31) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 'a'.encode('utf-16')
>>> a
b'\xff\xfea\x00'
>>> a.decode('utf-16')
'a'
>>> a = 'a'.encode('utf-16-le')
>>> a
b'a\x00'
>>> a.decode('utf-16-le')
'a'
>>> a = 'a'.encode('utf-16-be')
>>> a
b'\x00a'
>>> a.decode('utf-16-be')
'a'

您可以按照 @filmor 的回答

There are various utf-16 encodings.

  • utf-16-be big endian no BOM

  • utf-16-le little endian no BOM

  • utf-16 little endian + BOM

Examples:

Python 3.2 (r32:88452, Feb 20 2011, 11:12:31) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 'a'.encode('utf-16')
>>> a
b'\xff\xfea\x00'
>>> a.decode('utf-16')
'a'
>>> a = 'a'.encode('utf-16-le')
>>> a
b'a\x00'
>>> a.decode('utf-16-le')
'a'
>>> a = 'a'.encode('utf-16-be')
>>> a
b'\x00a'
>>> a.decode('utf-16-be')
'a'

You can use these encodings as suggested by @filmor's answer

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文