Python行文件迭代和奇怪的字符

发布于 2024-08-30 09:00:03 字数 628 浏览 2 评论 0 原文

我有一个巨大的压缩文本文件,我需要逐行阅读。我同意以下内容:

for i, line in enumerate(codecs.getreader('utf-8')(gzip.open('file.gz'))):
  print i, line

在文件后期的某个时刻,python 输出与文件不同。这是因为 python 认为是换行符的奇怪特殊字符导致行被破坏。当我在“vim”中打开文件时,它们是正确的,但可疑字符的格式很奇怪。我可以做些什么来解决这个问题吗?

我尝试过其他编解码器,包括 utf-16、latin-1。我也尝试过不使用编解码器。

我使用“od”查看了该文件。果然,不该出现的地方出现了\n字符。但是,“错误”的字符前面有一个奇怪的字符。我认为这里有一些编码,其中一些字符是 2 字节,但如果没有正确查看,尾随字节是 \n 。

根据“od -h file”,有问题的字符是“1d1c”。

如果我替换:

gzip.open('file.gz')

为:

os.popen('zcat file.gz')

它工作正常(实际上,速度更快)。但是,我想知道我哪里错了。

I have a huge gzipped text file which I need to read, line by line. I go with the following:

for i, line in enumerate(codecs.getreader('utf-8')(gzip.open('file.gz'))):
  print i, line

At some point late in the file, the python output diverges from the file. This is because lines are getting broken due to weird special characters that python thinks are newlines. When I open the file in 'vim', they are correct, but the suspect characters are formatted weirdly. Is there something I can do to fix this?

I've tried other codecs including utf-16, latin-1. I've also tried with no codec.

I looked at the file using 'od'. Sure enough, there are \n characters where they shouldn't be. But, the "wrong" ones are prepended by a weird character. I think there's some encoding here with some characters being 2-bytes, but the trailing byte being a \n if not viewed properly.

According to 'od -h file' the offending character is '1d1c'.

If I replace:

gzip.open('file.gz')

With:

os.popen('zcat file.gz')

It works fine (and actually, quite faster). But, I'd like to know where I'm going wrong.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

一百个冬季 2024-09-06 09:00:03

在没有编解码器的情况下重试。以下重现了使用编解码器时的问题,以及不使用编解码器时不会出现的问题:

import gzip 
import os 
import codecs 

data = gzip.open("file.gz", "wb") 
data.write('foo\x1d\x1cbar\nbaz') 
data.close() 

print list(codecs.getreader('utf-8')(gzip.open('file.gz'))) 
print list(os.popen('zcat file.gz')) 
print list(gzip.open('file.gz')) 

输出:

[u'foo\x1d', u'\x1c', u'bar\n', u'baz']
['foo\x1d\x1cbar\n', 'baz']
['foo\x1d\x1cbar\n', 'baz']

Try again with no codec. The following reproduces your problem when using codec, and the absence of the problem without it:

import gzip 
import os 
import codecs 

data = gzip.open("file.gz", "wb") 
data.write('foo\x1d\x1cbar\nbaz') 
data.close() 

print list(codecs.getreader('utf-8')(gzip.open('file.gz'))) 
print list(os.popen('zcat file.gz')) 
print list(gzip.open('file.gz')) 

Outputs:

[u'foo\x1d', u'\x1c', u'bar\n', u'baz']
['foo\x1d\x1cbar\n', 'baz']
['foo\x1d\x1cbar\n', 'baz']
电影里的梦 2024-09-06 09:00:03

我问(在评论中)“”“向我们展示 print repr(weird_special_characters) 的输出。当你在 vim 中打开文件时,什么是正确的?请比“奇怪的格式”更精确。“”“但什么都没有:- (

您正在使用 od 查看什么文件?file.gz??如果您可以在其中看到任何可识别的内容,则它不是 gzip 文件!您没有看到换行符,您看到包含 0x0A 的二进制字节。

如果原始文件是 utf-8 编码的,那么使用其他编解码器尝试它有什么意义?

“与 zcat 兼容”是否意味着您无需 utf8 解码步骤即可获得可识别的数据 ? ??

我建议您简化代码,一次一步地执行...例如,参见 这个问题。再试一次,请显示您运行的确切代码,并在描述结果时使用 repr()。

更新 看来 DS 已经猜到了您试图解释 \x1c 和 \x1d 的内容。

以下是关于为什么会发生这种情况的一些注释:

在 ASCII 中,换行时仅考虑 \r 和 \n:

>>> import pprint
>>> text = ''.join('A' + chr(i) for i in range(32)) + 'BBB'
>>> print repr(text)
'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10
A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
['A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 'A\x0bA\x0cA\r', # line break
 'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x
1dA\x1eA\x1fBBB']
>>>

但是在 Unicode 中,字符 \x1D(文件分隔符)、\x1E(组分隔符)和 \x1E (记录分隔符)也有资格作为行结束:

>>> text = u''.join('A' + unichr(i) for i in range(32)) + u'BBB'
>>> print repr(text)
u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
[u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 u'A\x0bA\x0cA\r', # line break
 u'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1c', # line break
 u'A\x1d', # line break
 u'A\x1e', # line break
 u'A\x1fBBB']
>>>

无论您使用什么编解码器,都会发生这种情况。您仍然需要弄清楚需要使用什么(如果有)编解码器。您还需要确定原始文件是否真的是文本文件而不是二进制文件。如果是文本文件,则需要考虑文件中\x1c和\x1d的含义。

I asked (in a comment) """Show us the output from print repr(weird_special_characters). When you open the file in vim, WHAT are correct? Please be more precise than "formatted weirdly".""" But nothing :-(

What file are you looking at with od? file.gz?? If you can see anything recognisable in there, it's not a gzip file! You're not seeing newlines, you're seeing binary bytes that contain 0x0A.

If the original file was utf-8 encoded, what was the point of trying it with other codecs?

Does "works OK with zcat" mean that you got recognisable data without a utf8 decode step??

I suggest that you simplify your code, and do it a step at a time ... see for example the accepted answer to this question. Try it again and please show the exact code that you ran, and use repr() when describing the results.

Update It looks like DS has guessed what you were trying to explain about the \x1c and \x1d.

Here are some notes on WHY it happens like that:

In ASCII, only \r and \n are considered when line-breaking:

>>> import pprint
>>> text = ''.join('A' + chr(i) for i in range(32)) + 'BBB'
>>> print repr(text)
'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10
A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
['A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 'A\x0bA\x0cA\r', # line break
 'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x
1dA\x1eA\x1fBBB']
>>>

However in Unicode, the characters \x1D (FILE SEPARATOR), \x1E (GROUP SEPARATOR), and \x1E (RECORD SEPARATOR) also qualify as line-endings:

>>> text = u''.join('A' + unichr(i) for i in range(32)) + u'BBB'
>>> print repr(text)
u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
[u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 u'A\x0bA\x0cA\r', # line break
 u'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1c', # line break
 u'A\x1d', # line break
 u'A\x1e', # line break
 u'A\x1fBBB']
>>>

This will happen whatever codec you use. You still need to work out what (if any) codec you need to use. You also need to work out whether the original file was really a text file and not a binary file. If it's a text file, you need to consider the meaning of the \x1c and \x1d in the file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文