Python - Python 3.1 似乎无法处理 UTF-16 编码的文件？

发布于 2024-10-31 19:05:54 字数 750 浏览 1 评论 0原文

我正在尝试运行一些代码来简单地浏览一堆文件并将那些恰好是 .txt 文件的文件写入同一个文件中，删除所有空格。这是一些应该可以解决问题的简单代码：

for subdir, dirs, files in os.walk(rootdir):
for file in files:
    if '.txt' in file:
        f = open(subdir+'/'+file, 'r')
        line = f.readline()
        while line:
            line2 = line.split()
            if line2:
                output_file.write(" ".join(line2)+'\n')
            line = f.readline()
        f.close()

但是，我收到以下错误：

文件“/usr/lib/python3.1/codecs.py”，第 300 行，在解码中（结果，消耗）= self._buffer_decode（数据，self.errors，最终） UnicodeDecodeError: 'utf8' 编解码器无法解码位置 0 中的字节 0xfe：意外的代码字节

事实证明，这些 .txt 文件都是 UTF-16 格式的（无论如何，根据 FireFox 的说法）。我认为 Python 3.x 应该能够处理任何类型的字符编码？

最好的，乔治娜

原文

I'm trying to run some code to simply go through a bunch of files and write those that happen to be .txt files into the same file, removing all the spaces. Here's some simple code that should do the trick:

for subdir, dirs, files in os.walk(rootdir):
for file in files:
    if '.txt' in file:
        f = open(subdir+'/'+file, 'r')
        line = f.readline()
        while line:
            line2 = line.split()
            if line2:
                output_file.write(" ".join(line2)+'\n')
            line = f.readline()
        f.close()

But instead, I get the following error:

File "/usr/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 0: unexpected code byte

It turns out these .txt files are all in UTF-16 (according to FireFox, at any rate). I thought Python 3.x was supposed to be able to handle any sort of character encoding??

Best,
Georgina

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只是一片海 2024-11-07 19:05:54

使用open(bla, 'r',encoding="utf-16")。

回复收藏 0 原文

明媚如初 2024-11-07 19:05:54

有多种 utf-16 编码。

utf-16-be 大端无 BOM
utf-16-le 小端无 BOM
utf-16 小端 + BOM

示例：

Python 3.2 (r32:88452, Feb 20 2011, 11:12:31) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 'a'.encode('utf-16')
>>> a
b'\xff\xfea\x00'
>>> a.decode('utf-16')
'a'
>>> a = 'a'.encode('utf-16-le')
>>> a
b'a\x00'
>>> a.decode('utf-16-le')
'a'
>>> a = 'a'.encode('utf-16-be')
>>> a
b'\x00a'
>>> a.decode('utf-16-be')
'a'

您可以按照 @filmor 的回答

There are various utf-16 encodings.

utf-16-be big endian no BOM
utf-16-le little endian no BOM
utf-16 little endian + BOM

Examples:

Python 3.2 (r32:88452, Feb 20 2011, 11:12:31) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 'a'.encode('utf-16')
>>> a
b'\xff\xfea\x00'
>>> a.decode('utf-16')
'a'
>>> a = 'a'.encode('utf-16-le')
>>> a
b'a\x00'
>>> a.decode('utf-16-le')
'a'
>>> a = 'a'.encode('utf-16-be')
>>> a
b'\x00a'
>>> a.decode('utf-16-be')
'a'

You can use these encodings as suggested by @filmor's answer

回复收藏 0 原文

~没有更多了~